{ "info": { "author": "MIT Data To AI Lab", "author_email": "dailabmit@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence" ], "description": "

\n\u201cSDGym\u201d\nAn open source project from Data to AI Lab at MIT.\n

\n\n\n[![Travis](https://travis-ci.org/DAI-Lab/SDGym.svg?branch=master)](https://travis-ci.org/DAI-Lab/SDGym)\n[![PyPi Shield](https://img.shields.io/pypi/v/sdgym.svg)](https://pypi.python.org/pypi/sdgym)\n\n\n\n# SDGym - Synthetic Data Gym\n\n- License: MIT\n- Documentation: https://DAI-Lab.github.io/SDGym/\n- Homepage: https://github.com/DAI-Lab/SDGym\n\n# Overview\n\nSynthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators\nfor non-temporal tabular data. SDGym is based on the paper [Modeling Tabular data using Conditional\nGAN](https://arxiv.org/abs/1907.00503), and the project is part of the [Data to AI\nLaboratory](https://dai.lids.mit.edu/) at MIT.\n\nThe benchmarking of a synthesizer is a process in which different datasets are generated by your\nsynthesizer. Then, each couple of real and synthetic data is evaluated with multiple scores.\n\n## What is a synthesizer function?\n\nIn order to use **SDGym** you will need a synthesizer function.\nThis is a function that takes as input a numpy matrix with real data and outputs another numpy\nmatrix with the same shape filled with synthesized data.\n\nAlso, alongside the real data, some additional variables informing about the column contents\nwill be passed, which means that the exact signature of the function will be like this:\n\n\n```python\ndef my_synthesizer_function(\n real_data: numpy.ndarray,\n categorical_columns: list,\n ordinal_columns: list\n) -> syntehtesized_data: numpy.ndarray\n```\n\nIf your synthesizer implements a different interface, you can wrap it in a function like this:\n\n```python\ndef my_synthesizer_function(real_data, categorical_columns, ordinal_columns):\n # ...do all necessary steps here...\n return synthesized_data\n```\n\nThis function should contain inside it all the parameters and arguments needed to use your\nsynthesizer and call it to generate the new synthesized data based on the real data that is\nbeing passed.\n\n## What data should your synthesizer work with?\n\nAs we mentioned in the section before, the main input of **SDGym** is a synthesizer to be\nbenchmarked, which is expected to be a function that has as unique input and output a table of\ndata.\n\nThe inputs for your synthesizer funciton should be:\n\n* `real_data`: a 2D `numpy.ndarray` with the real data the your synthesizer will attempt to imitate.\n* `categorical_columns`: a `list` with the indexes of any columns that should be considered\n categorical independently on their type.\n* `ordinal_columns`: a `list` with the indexes of any integer columns that should be treated as\n ordinal values.\n\nAnd the output should be a single 2D `numpy.ndarray` with the exact same shape as the `real_data`\nmatrix.\n\n## Benchmark datasets\n\nAll the datasets used for the benchmarking can be found inside the [sgdym S3 bucket](http://sdgym.s3.amazonaws.com/index.html)\nin the form of an `.npz` numpy matrix archive and a `.json` metadata file that contains information\nabout the dataset structure and their columns.\n\nIn order to load these datasets in the same format as they will be passed to your synthesizer function\nyou can use the `sdgym.load_dataset` function passing the name of the dataset to load.\n\nIn this example, we will load the `adult` dataset:\n\n```python\nfrom sdgym import load_dataset\n\ndata, categorical_columns, ordinal_columns = load_dataset('adult')\n```\n\nThis will return a numpy matrix with the data that will be passed to your synthesizer function,\nas well as the list of indexes for the categorical and ordinal columns:\n\n```python\n>>> data\narray([[2.70000e+01, 0.00000e+00, 1.77119e+05, ..., 4.40000e+01,\n 0.00000e+00, 0.00000e+00],\n [2.70000e+01, 0.00000e+00, 2.16481e+05, ..., 4.00000e+01,\n 0.00000e+00, 0.00000e+00],\n [2.50000e+01, 0.00000e+00, 2.56263e+05, ..., 4.00000e+01,\n 0.00000e+00, 0.00000e+00],\n ...,\n [4.50000e+01, 0.00000e+00, 2.07540e+05, ..., 4.00000e+01,\n 0.00000e+00, 1.00000e+00],\n [5.10000e+01, 0.00000e+00, 1.80807e+05, ..., 4.00000e+01,\n 0.00000e+00, 0.00000e+00],\n [6.10000e+01, 4.00000e+00, 1.86451e+05, ..., 4.00000e+01,\n 0.00000e+00, 1.00000e+00]], dtype=float32)\n>>> categorical_columns\n[1, 5, 6, 7, 8, 9, 13, 14]\n>>> ordinal_columns\n[3]\n```\n\n## Demo Synthesizers\n\nIn order to get started using the benchmarking tool, some demo Synthesizers have been included\nin the library.\n\nThese synthesizers are classes that can be imported from the `sdgym.synthesizers` module and have\nthe following methods:\n\n* `fit`: Fits the synthesizer on the data. Expects the following arguments:\n * `data (numpy.ndarray)`: 2 dimensional Numpy matrix with the real data to learn from.\n * `categorical_columns (list or tuple)`: List of indexes of the columns that are categorical within the dataset.\n * `ordinal_columns (list or tuple)`: List of indexes of the columns that are ordinal within the dataset.\n* `sample`: Generates new data resembling the original dataset. Expects the following arguments:\n * `n_samples (int)`: Number of samples to generate.\n* `fit_sample`: Fits the synthesizer on the dataset and then samples as many rows as there were in\n the original dataset. It expects the same arguments as the `fit` method, and is ready to be directly\n passed to the `benchmark` function in order to evaluate the synthesizer performance.\n\nA complete example about how to use them can be found below in the [Usage](#usage) section.\n\n# Install\n\n## Requirements\n\n**SDGym** has been developed and tested on [Python 3.5, and 3.6](https://www.python.org/downloads/)\n\nAlso, although it is not strictly required, the usage of a [virtualenv](https://virtualenv.pypa.io/en/latest/)\nis highly recommended in order to avoid interfering with other software installed in the system\nwhere **SDGym** is run.\n\nThese are the minimum commands needed to create a virtualenv using python3.6 for **SDGym**:\n\n```bash\npip install virtualenv\nvirtualenv -p $(which python3.6) sdgym-venv\n```\n\nAfterwards, you have to execute this command to have the virtualenv activated:\n\n```bash\nsource sdgym-venv/bin/activate\n```\n\nRemember about executing it every time you start a new console to work on **SDGym**!\n\n## Install with pip\n\nAfter creating the virtualenv and activating it, we recommend using\n[pip](https://pip.pypa.io/en/stable/) in order to install **SDGym**:\n\n```bash\npip install sdgym\n```\n\nThis will pull and install the latest stable release from [PyPi](https://pypi.org/).\n\n## Install from source\n\nAlternatively, with your virtualenv activated, you can clone the repository and install it from\nsource by running `make install` on the `stable` branch:\n\n```bash\ngit clone git@github.com:DAI-Lab/SDGym.git\ncd SDGym\ngit checkout stable\nmake install\n```\n\n## Install for Development\n\nIf you want to contribute to the project, a few more steps are required to make the project ready\nfor development.\n\nFirst, please head to [the GitHub page of the project](https://github.com/DAI-Lab/SDGym)\nand make a fork of the project under you own username by clicking on the **fork** button on the\nupper right corner of the page.\n\nAfterwards, clone your fork and create a branch from master with a descriptive name that includes\nthe number of the issue that you are going to work on:\n\n```bash\ngit clone git@github.com:{your username}/SDGym.git\ncd SDGym\ngit branch issue-xx-cool-new-feature master\ngit checkout issue-xx-cool-new-feature\n```\n\nFinally, install the project with the following command, which will install some additional\ndependencies for code linting and testing.\n\n```bash\nmake install-develop\n```\n\nMake sure to use them regularly while developing by running the commands `make lint` and\n`make test`.\n\n## Compile C++ dependencies\n\nIn order to be able to use all the features from SDGym, some dependencies written in C++ need to\nbe compiled.\n\nIn order to do this:\n\n1. make sure to have installed all the necessary dependencies to compile C++. In Linux distributions\nbased on Ubuntu, this can be done with the following command:\n\n```bash\nsudo apt-get install build-essential\n```\n\n2. Execute:\n\n```bash\nmake compile\n```\n\n# Usage\n\n## Benchmark\n\nAll you need to do in order to use the SDGym Benchmark, is to import and call the `sdgym.benchmark`\nfunction passing it your synthesizer function:\n\n```python\nfrom sdgym import benchmark\n\nscores = benchmark(my_synthesizer_function)\n```\n\nThe output of the `benchmark` function will be a `pd.DataFrame` containing all the scores\ncomputed by the different evaluators:\n\n```\n accuracy f1 name distance dataset iter\n0 0.7985 0.658301 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 0\n1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 0\n2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 0\n3 0.8489 0.678716 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 0\n0 0.7968 0.655943 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 1\n1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 1\n2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 1\n3 0.8472 0.683775 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 1\n0 0.7963 0.655272 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 2\n1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 2\n2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 2\n3 0.8511 0.684467 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 2\n```\n\n## Using the Demo Synthesizers\n\nIn order to use the synthesizer classes included in **SDGym**, you need to follow these steps:\n\n1. Import the synthesizer class from `sdgym.synthesizers`:\n\n```python\nfrom sdgym.synthesizers import IndependentSynthesizer\n```\n\n2. Create an instance of the synthesizers passing any needed arguments. In this case we will use\nthe `IndependentSynthesizer`, which can be instantiated with no initialization arguments:\n\n```python\nsynthesizer = IndependentSynthesizer()\n```\n\n3. Load some data to fit your synthesizer with. In this case, we will be loading the `adult`\ndataset:\n\n```python\nfrom sdgym import load_dataset\n\ndata, categorical_columns, ordinal_columns = load_dataset('adult')\n```\n\n3. Call its `fit` method passing the data as well as the lists of categorical and ordinal columns:\n\n```python\nsynthesizer.fit(data, categorical_columns, ordinal_columns)\n```\n\n4. Call its `sample` method passing the number of rows that we want to sample:\n\n```python\nsampled = synthesizer.sample(3)\n```\n\nThis will return a numpy matrix of sampeld data with the same columns as the original data and\nas many rows as we have requested:\n\n```\narray([[5.1774925e+01, 0.0000000e+00, 5.3538445e+04, 6.0000000e+00,\n 8.9999313e+00, 2.0000000e+00, 1.0000000e+00, 3.0000000e+00,\n 2.0000000e+00, 1.0000000e+00, 3.7152294e-04, 1.9912617e-04,\n 1.0767025e+01, 0.0000000e+00, 0.0000000e+00],\n [6.4843109e+01, 0.0000000e+00, 2.6462553e+05, 1.2000000e+01,\n 8.9993210e+00, 1.0000000e+00, 0.0000000e+00, 1.0000000e+00,\n 0.0000000e+00, 0.0000000e+00, 5.3685449e-06, 1.9797031e-03,\n 2.2253288e+01, 0.0000000e+00, 0.0000000e+00],\n [6.5659584e+01, 5.0000000e+00, 3.6158912e+05, 8.0000000e+00,\n 9.0010223e+00, 0.0000000e+00, 1.2000000e+01, 3.0000000e+00,\n 0.0000000e+00, 0.0000000e+00, 1.0562389e-03, 0.0000000e+00,\n 3.9998917e+01, 0.0000000e+00, 0.0000000e+00]], dtype=float32)\n```\n\n## Benchmarking the Demo Synthesizers\n\nEvaluating the performance of any of the Demo synthesizers as as simple as:\n\n1. Creaeting an instance of the synthesizer:\n\n```python\nsynthesizer = IndependentSynthesizer()\n```\n\n2. Passing the `fit_sample` method of the instance to the `benchmark` function as your\nsynthesizer function:\n\n```python\nbenchmark(synthesizer.fit_sample)\n```\n\n# What's next?\n\nFor more details about **SDGym** and all its possibilities and features, please check the\n[documentation site](https://DAI-Lab.github.io/SDGym/).\n\nThere you can learn more about\n[how to contribute to SDGym](https://HDI-Project.github.io/SDGym/community/contributing.html)\nin order to help us developing new features or cool ideas.\n\n# Related Projects\n\n## SDV\n\n[SDV](https://github.com/HDI-Project/SDV), for Synthetic Data Vault, is the end-user library for\nsynthesizing data in development under the [HDI Project](https://hdi-dai.lids.mit.edu/).\nSDV allows you to easily model and sample relational datasets using Copulas thought a simple API.\nOther features include anonymization of Personal Identifiable Information (PII) and preserving\nrelational integrity on sampled records.\n\n## TGAN\n\n[TGAN](https://github.com/DAI-Lab/TGAN) is a GAN based model for synthesizing tabular data.\nIt's also developed by the [MIT's Data to AI Lab](https://dai-lab.github.io/) and is under\nactive development.\n\n\n# History\n\n## v0.1.0 - 2019-08-07\n\nFirst release to PyPi\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/DAI-Lab/SDGym", "keywords": "machine learning synthetic data benchmark generative models", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "sdgym", "package_url": "https://pypi.org/project/sdgym/", "platform": "", "project_url": "https://pypi.org/project/sdgym/", "project_urls": { "Homepage": "https://github.com/DAI-Lab/SDGym" }, "release_url": "https://pypi.org/project/sdgym/0.1.0/", "requires_dist": [ "matplotlib (<4,>=3.1.0)", "numpy (<1.18,>=1.16.3)", "pandas (<0.26,>=0.24.2)", "pomegranate (<0.12,>=0.11.0)", "scikit-learn (<0.22,>=0.21.1)", "scipy (<2,>=1.3.0)", "torch (<2,>=1.1.0)", "torchvision (>=0.3.0)", "bumpversion (>=0.5.3) ; extra == 'dev'", "pip (>=9.0.1) ; extra == 'dev'", "watchdog (>=0.8.3) ; extra == 'dev'", "m2r (>=0.2.0) ; extra == 'dev'", "Sphinx (>=1.7.1) ; extra == 'dev'", "sphinx-rtd-theme (>=0.2.4) ; extra == 'dev'", "autodocsumm (>=0.1.10) ; extra == 'dev'", "flake8 (>=3.7.7) ; extra == 'dev'", "isort (>=4.3.4) ; extra == 'dev'", "autoflake (>=1.1) ; extra == 'dev'", "autopep8 (>=1.4.3) ; extra == 'dev'", "twine (>=1.10.0) ; extra == 'dev'", "wheel (>=0.30.0) ; extra == 'dev'", "coverage (>=4.5.1) ; extra == 'dev'", "tox (>=2.9.1) ; extra == 'dev'", "pytest (>=3.4.2) ; extra == 'dev'", "pytest-cov (>=2.6.0) ; extra == 'dev'", "pytest (>=3.4.2) ; extra == 'test'", "pytest-cov (>=2.6.0) ; extra == 'test'" ], "requires_python": ">=3.6", "summary": "A framework to benchmark the performance of synthetic data generators for non-temporal tabular data", "version": "0.1.0" }, "last_serial": 5648865, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "4efda65398d4d98c00368eff335e0b0d", "sha256": "f147fd8ecbb80d36c9822a01f0f5be101b252c2a5d58b18193e06fecd745752a" }, "downloads": -1, "filename": "sdgym-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4efda65398d4d98c00368eff335e0b0d", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 30713, "upload_time": "2019-08-08T08:52:44", "url": "https://files.pythonhosted.org/packages/b4/c1/cd8eb52e56672b1d898ebb5467dc0c887e364512032bf8393a1f48d14b8f/sdgym-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6eb90837022ed90cfa57015aea5eabcd", "sha256": "60f25bd1b04e74f2262d5ec0835edd37abfee1464d5380f3f9abaa77b9fcfdf0" }, "downloads": -1, "filename": "sdgym-0.1.0.tar.gz", "has_sig": false, "md5_digest": "6eb90837022ed90cfa57015aea5eabcd", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1415628, "upload_time": "2019-08-08T08:53:14", "url": "https://files.pythonhosted.org/packages/3b/29/d51ee48a1964f62cfa86f955bf108e6f41402185832fe804c803a5689391/sdgym-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4efda65398d4d98c00368eff335e0b0d", "sha256": "f147fd8ecbb80d36c9822a01f0f5be101b252c2a5d58b18193e06fecd745752a" }, "downloads": -1, "filename": "sdgym-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4efda65398d4d98c00368eff335e0b0d", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 30713, "upload_time": "2019-08-08T08:52:44", "url": "https://files.pythonhosted.org/packages/b4/c1/cd8eb52e56672b1d898ebb5467dc0c887e364512032bf8393a1f48d14b8f/sdgym-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6eb90837022ed90cfa57015aea5eabcd", "sha256": "60f25bd1b04e74f2262d5ec0835edd37abfee1464d5380f3f9abaa77b9fcfdf0" }, "downloads": -1, "filename": "sdgym-0.1.0.tar.gz", "has_sig": false, "md5_digest": "6eb90837022ed90cfa57015aea5eabcd", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1415628, "upload_time": "2019-08-08T08:53:14", "url": "https://files.pythonhosted.org/packages/3b/29/d51ee48a1964f62cfa86f955bf108e6f41402185832fe804c803a5689391/sdgym-0.1.0.tar.gz" } ] }