{ "info": { "author": "Bauke Brenninkmeijer", "author_email": "bauke.brenninkmeijer@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# TableEvaluator\n[![PyPI version](https://badge.fury.io/py/table-evaluator.svg)](https://badge.fury.io/py/table-evaluator)\n\nTableEvaluator is a library to evaluate how similar a synthesized dataset is to a real data. In other words, it tries to give an indication into how real your fake data is. With the rise of GANs, specifically designed for tabular data, many applications are becoming possibilities. For industries like finance, healthcare and goverments, having the capacity to create high quality synthetic data that does **not** have the privacy constraints of normal data is extremely valuable. Since this field is this quite young and developing, I created this library to have a consistent evaluation method for your models.\n\n## Installation\nThe package can be installed with\n```\npip install table_evaluator\n```\n\n## Usage\nStart by importing the class\n```Python\nfrom table_evaluator import *\n```\n\nThe package is used by having two DataFrames; one with the real data and one with the synthetic data. These are passed to the TableEvaluator on init.\nThe `helpers.load_data` is nice to retrieve these dataframes from disk since it converts them to the same dtypes and columns after loading. However, any dataframe will do as long as they have the same columns and data types.\n\n Using the test data available in the `data` directory, we do:\n\n```python\nreal, fake = load_data('data/real_test_sample.csv', 'data/fake_test_sample.csv')\n\n```\nwhich gives us two dataframes and specifies which columns should be treated as categorical columns.\n\n```python\nreal.head()\n```\n\n\n| trans_id | account_id | trans_amount | balance_after_trans | trans_type | trans_operation | trans_k_symbol | trans_date |\n|----------|------------|--------------|---------------------|------------|----------------------------|-------------------|------------|\n| 951892 | 3245 | 3878.0 | 13680.0 | WITHDRAWAL | REMITTANCE_TO_OTHER_BANK | HOUSEHOLD | 2165 |\n| 3547680 | 515 | 65.9 | 14898.6 | CREDIT | UNKNOWN | INTEREST_CREDITED | 2006 |\n| 1187131 | 4066 | 32245.0 | 57995.5 | CREDIT | COLLECTION_FROM_OTHER_BANK | UNKNOWN | 2139 |\n| 531421 | 1811 | 3990.8 | 23324.9 | WITHDRAWAL | REMITTANCE_TO_OTHER_BANK | LOAN_PAYMENT | 892 |\n| 37081 | 119 | 12100.0 | 36580.0 | WITHDRAWAL | WITHDRAWAL_IN_CASH | UNKNOWN | 654 |\n\n\n```python\nfake.head()\n```\n\n| trans_id | account_id | trans_amount | balance_after_trans | trans_type | trans_operation | trans_k_symbol | trans_date |\n|----------|------------|--------------|---------------------|------------|----------------------------|----------------|------------|\n| 911598 | 3001 | 13619.0 | 92079.0 | CREDIT | COLLECTION_FROM_OTHER_BANK | UNKNOWN | 1885 |\n| 377371 | 1042 | 4174.0 | 32470.0 | WITHDRAWAL | REMITTANCE_TO_OTHER_BANK | HOUSEHOLD | 1483 |\n| 970113 | 3225 | 274.0 | 57608.0 | WITHDRAWAL | WITHDRAWAL_IN_CASH | UNKNOWN | 1855 |\n| 450090 | 1489 | 301.0 | 36258.0 | CREDIT | CREDIT_IN_CASH | UNKNOWN | 885 |\n| 1120409 | 3634 | 6303.0 | 50975.0 | WITHDRAWAL | REMITTANCE_TO_OTHER_BANK | HOUSEHOLD | 1211 |\n\n\n```Python\ncat_cols = ['trans_type', 'trans_operation', 'trans_k_symbol']\n```\n\nIf we do not specify categorical columns when initializing the TableEvaluator, it will consider all columns with more than 50 unique values a continuous column and anything with less a categorical columns.\n\nThen we create the TableEvaluator object:\n```Python\ntable_evaluator = TableEvaluator(real, fake, cat_cols=cat_cols)\n```\n\nIt's nice to start with some plots to get a feel for the data and how they correlate. The test samples contain only 1000 samples, which is why the cumulative sum plots are not very smooth.\n\n```python\ntable_evaluator.visual_evaluation()\n```\n\n\n![png](images/output_7_0.png)\n\n\n\n![png](images/output_7_1.png)\n\n\n\n![png](images/output_7_2.png)\n\n\n\n![png](images/output_7_3.png)\n\n\nThe `evaluate` method gives us the most complete idea of how close the data sets are together.\n\n```python\ntable_evaluator.evaluate(target_col='trans_type')\n```\n\n\n Correlation metric: pearsonr\n\n Classifier F1-scores:\n real fake\n real_data_LogisticRegression_F1 0.8200 0.8150\n real_data_RandomForestClassifier_F1 0.9800 0.9800\n real_data_DecisionTreeClassifier_F1 0.9600 0.9700\n real_data_MLPClassifier_F1 0.3500 0.6850\n fake_data_LogisticRegression_F1 0.7800 0.7650\n fake_data_RandomForestClassifier_F1 0.9300 0.9300\n fake_data_DecisionTreeClassifier_F1 0.9300 0.9400\n fake_data_MLPClassifier_F1 0.3600 0.6200\n\n Miscellaneous results:\n Result\n Column Correlation Distance RMSE 0.0399\n Column Correlation distance MAE 0.0296\n Duplicate rows between sets (real/fake) (0, 0)\n nearest neighbor mean 0.5655\n nearest neighbor std 0.3726\n\n Results:\n Result\n basic statistics 0.9940\n Correlation column correlations 0.9904\n Mean Correlation between fake and real columns 0.9566\n 1 - MAPE Estimator results 0.7843\n 1 - MAPE 5 PCA components 0.9138\n Similarity Score 0.9278\n\n The similarity score is an aggregate metric of the five other metrics in the section with results. Additionally, the F1/RMSE scores are printed since they give valuable insights into the strengths and weaknesses of some of these models. Lastly, some miscellaneous results are printed, like the nearest neighbor distance between each row in the fake dataset and the closest row in the real dataset. This provides insight into the privacy retention capability of the model. Note that the mean and standard deviation of nearest neighbor is limited to 20k rows, due to time and hardware limitations.\n\n\nOther relevant methods are:\n```python\ntable_evaluator.statistical_evaluation()\ntable_evaluator.correlation_correlation()\ntable_evaluator.pca_correlation()\ntable_evaluator.estimator_evaluation()\ntable_evaluator.row_distance()\ntable_evaluator.get_duplicates()\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Baukebrenninkmeijer/Table-Evaluator", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "table-evaluator", "package_url": "https://pypi.org/project/table-evaluator/", "platform": "", "project_url": "https://pypi.org/project/table-evaluator/", "project_urls": { "Homepage": "https://github.com/Baukebrenninkmeijer/Table-Evaluator" }, "release_url": "https://pypi.org/project/table-evaluator/1.0.5/", "requires_dist": null, "requires_python": "", "summary": "A package to evaluate how close a synthetic data set is to real data.", "version": "1.0.5" }, "last_serial": 5938190, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "d63b7c64df2f6f1ba63ce760d3e112ea", "sha256": "04b15cce6d0ae035ff251531402bf3f61a04abce8ca5410c72fb86d4c110f283" }, "downloads": -1, "filename": "table_evaluator-1.0.0.tar.gz", "has_sig": false, "md5_digest": "d63b7c64df2f6f1ba63ce760d3e112ea", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17995, "upload_time": "2019-09-02T13:50:36", "url": "https://files.pythonhosted.org/packages/60/3d/74c641787d28dcb5edbf0934b0911401974beb09f9de32c3a848c8f892dc/table_evaluator-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "c4b5276b5f76de8698a7fd3c03f0c3a8", "sha256": "43dc6414b0f3c4b19e1d840d72d198c52c5147e1ee6892ed4b676d6444d38f93" }, "downloads": -1, "filename": "table_evaluator-1.0.1.tar.gz", "has_sig": false, "md5_digest": "c4b5276b5f76de8698a7fd3c03f0c3a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17813, "upload_time": "2019-09-02T14:38:37", "url": "https://files.pythonhosted.org/packages/9c/74/8b65ee03adb81dbc358efed380fde43d912e2e09ccf1fe9ab58a4241d691/table_evaluator-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "d272976b9ad2c6abf2a614fa0406014f", "sha256": "244340325717ca8c1b775b83eb270a2de2f06428c36ae2076c79350b23b017ca" }, "downloads": -1, "filename": "table_evaluator-1.0.2.tar.gz", "has_sig": false, "md5_digest": "d272976b9ad2c6abf2a614fa0406014f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17267, "upload_time": "2019-09-02T15:31:10", "url": "https://files.pythonhosted.org/packages/78/9e/d96405f70514a8f866ae75965053e17a83987eabbe677acdc5780e476904/table_evaluator-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "8b448094fa8ebe5ec5937647e722e2bd", "sha256": "9c8e369d42f7b66e18d4e494e69ce04a1bcc946f935022a16e2cddbf1ff1054b" }, "downloads": -1, "filename": "table-evaluator-1.0.3.tar.gz", "has_sig": false, "md5_digest": "8b448094fa8ebe5ec5937647e722e2bd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17269, "upload_time": "2019-09-05T14:05:11", "url": "https://files.pythonhosted.org/packages/05/77/a7d4330dd0ce8e1e4a75a7cb8eb96b519b252c2452d74538b86ab2635928/table-evaluator-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "30a4187e5a6d408fcea3a288164ea139", "sha256": "0192a4bd224e96519afda31acfc990c0a69605fd684e77dd822eec01f65be38b" }, "downloads": -1, "filename": "table-evaluator-1.0.4.tar.gz", "has_sig": false, "md5_digest": "30a4187e5a6d408fcea3a288164ea139", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17495, "upload_time": "2019-10-07T11:07:58", "url": "https://files.pythonhosted.org/packages/c5/7d/17ce15dcecb547e06d6ba96797d6815b996337468eaa289f83256b1894ee/table-evaluator-1.0.4.tar.gz" } ], "1.0.5": [ { "comment_text": "", "digests": { "md5": "b72b60990455db9b128d3888f1e8fa8b", "sha256": "1af2c41b82426af4e5392063da599fbc19ed52668c529bd1b25370c0c722fe81" }, "downloads": -1, "filename": "table-evaluator-1.0.5.tar.gz", "has_sig": false, "md5_digest": "b72b60990455db9b128d3888f1e8fa8b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17495, "upload_time": "2019-10-07T11:14:48", "url": "https://files.pythonhosted.org/packages/2f/73/34228b116e958c519fb228ee2ab1a5f77d9f30e5299c0e270187c949afc3/table-evaluator-1.0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b72b60990455db9b128d3888f1e8fa8b", "sha256": "1af2c41b82426af4e5392063da599fbc19ed52668c529bd1b25370c0c722fe81" }, "downloads": -1, "filename": "table-evaluator-1.0.5.tar.gz", "has_sig": false, "md5_digest": "b72b60990455db9b128d3888f1e8fa8b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17495, "upload_time": "2019-10-07T11:14:48", "url": "https://files.pythonhosted.org/packages/2f/73/34228b116e958c519fb228ee2ab1a5f77d9f30e5299c0e270187c949afc3/table-evaluator-1.0.5.tar.gz" } ] }