{ "info": { "author": "Kevin Crouse", "author_email": "krcrouse@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.7" ], "description": "[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n# pydatacleaner\n\npydatacleaner is a Python library to standardize data and to flexibly render them into the most correct datatypes. It was originally a significant part of the matrixb project to handle errors and inconsistencies when reading in data files.\n\nThe core DataCleaner class is a general base class and has functions primarily designed to look at text values and convert them to numbers, dates, booleans, etc, as appropriate. The *convert_{datatype}* properties indicate which types should be investigated, which triggers the *parse_{datatype}* as eligible.\n\nSubclasses of datacleaner include SnakeCase and CamelCase, and as suggested, will convert strings to their snake or camel versions, and can translate between each other via the *tokenize()* and *join()* functions, which are abstract in the DataCleaner baseclass.\n\n## Distribution\n\n* [PyPI Distribution Page](https://pypi.org/project/pydatacleaner)\n* [GitLab Project](https://gitlab.com/krcrouse/datacleaner)\n* [Read the Docs Full API Documentation](https://datacleaner.readthedocs.io)\n\n## Project Status\n\nCurrently, pydatacleaner is functional but shallowly vetted condition and should be considered **alpha** software. Your mileage may vary.\n\nCode comments of *NOTE* and *TODO* indicate known shortcomings that may be useful to you. The interface will likely change in future versions.\n\nIf you wish to rely on features of this package, I am likely more than willing to accommodate and to incorporate sensible design improvements or, in some cases, changes.\n\n### Limitations and Future Directions\n\nThe clean function is relatively slow, especially when considering all data translations for large datasets. I will be focusing on performance improvements by writing many of the core data processing in C instead of using the pure-python version.\n\n## Installation\n\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install pydatacleaner.\n\n```bash\npip install pydatacleaner\n```\n\n## Usage\n\nMany examples of usage are available in the main test files included in the t/ subdirectory.\n\n```python\nimport datacleaner\n\n#\n# Auto-translating various null values to None\n#\n\ndefault_null_values = ['NA', 'na', 'N/A', 'n/a', 'NULL', 'null', 'None', 'none', 'nan', 'NaN', '#N/A']\n\ncleaner = datacleaner.DataCleaner(\n null_values = default_null_values,\n)\n\nfor ok in ('a','123', 123, 15.0, 2.66, -3.55, datetime.date(2017,3,5)):\n val = cleaner.clean(ok)\n assert ok is not None\n\nfor nullable in default_null_values:\n assert nullable is not None\n assert cleaner.clean(nullable) is None\n\nadditional = ['Toothpaste', -999, 0, 15.2, 'Woof', 'foo bar']\nfor ok in additional:\n val = cleaner.clean(ok)\n assert ok is not None\n\ncleaner.add_null_values(additional)\nfor nullable in additional:\n assert nullable is not None\n assert cleaner.clean(nullable) is None\n\nassert cleaner.clean(None) is None\n\n#\n# Examples of using translations\n#\n\ninitial_translations = {\n -999:None,\n 888:'Missing',\n '-999': 'N/A',\n '888': 5,\n None: 0,\n '777': 'INVALID string num not converting to num first',\n 777: 'num convert is accurate',\n 'xxx': 0,\n}\n\ncleaner = datacleaner.DataCleaner(\n translations = initial_translations\n)\n# this verifies the expected cases that types are converted before translations,\n# so -999 and '-999' should relate to the same output (though we also dont trust order for hashes)\nassert cleaner.clean('777') == 'num convert is accurate'\nassert cleaner.clean(-999) == None\nassert cleaner.clean('-999') == None\nassert cleaner.clean(888) == 'Missing'\nassert cleaner.clean('888') == 'Missing'\nassert cleaner.clean(None) == 0\nassert cleaner.clean('xxx') == 0\n\n#\n# test that adding translations after initialization works\nassert cleaner.clean(999) == 999\ncleaner.add_translations({999:None})\nassert cleaner.clean(999) == None\n\nbasic_transliterations = [\n ('a','X'),\n [' ','_']\n]\ncleaner = datacleaner.DataCleaner(\n transliterations= basic_transliterations,\n)\n\nassert cleaner.clean('aaa') == 'XXX'\nassert cleaner.clean('woof and bark') == 'woof_Xnd_bXrk'\n\n#\n# Time processing, requiring that the string ends up as a time object.\n# Setting data_type means that the values are required to be None or time,\n# and an error is thrown otherwise. \n#\ncleaner = datacleaner.DataCleaner( data_type=datetime.time )\n\ntimes = {\n '15:00': datetime.time(15,0),\n '2:00 pm': datetime.time(14,0),\n '2:00': datetime.time(2,0),\n '2:00PM': datetime.time(14,0),\n '0:00': datetime.time(0,0),\n '12:00': datetime.time(12,0),\n '12:00pm': datetime.time(12,0),\n '12:00 a.m.': datetime.time(0,0),\n '18:55:30.35': datetime.time(18,55,30, 350000),\n '18:55:30': datetime.time(18,55,30),\n '6:55:30 pm': datetime.time(18,55,30),\n}\n\nfor raw, test in times.items():\n assert cleaner.clean(raw) == test\n\n\n#\n# Dates\n#\n\ndates = {\n '03-Aug-06': datetime.date(2006,8,3),\n '03-Aug-2006': datetime.date(2006,8,3),\n '3-Aug-06': datetime.date(2006,8,3),\n '3-Aug-2006': datetime.date(2006,8,3),\n '3-August-06': datetime.date(2006,8,3),\n '3-August-2006': datetime.date(2006,8,3),\n 'Aug-03-06': datetime.date(2006,8,3),\n 'Aug-03-2006': datetime.date(2006,8,3),\n 'Monday, 3 of August 2006': datetime.date(2006,8,3),\n '03Aug06': datetime.date(2006,8,3),\n '03Aug2006': datetime.date(2006,8,3),\n '2006-08-03': datetime.date(2006,8,3),\n '20060803': datetime.date(2006,8,3),\n 20060803: datetime.date(2006,8,3),\n '2006/08/03': datetime.date(2006,8,3),\n # month - day - year\n '08/03/06': datetime.date(2006,8,3),\n '08/03/2006': datetime.date(2006,8,3),\n '8/3/06': datetime.date(2006,8,3),\n '8/3/2006': datetime.date(2006,8,3),\n '12.18.97': datetime.date(1997,12,18),\n '12.25.2006': datetime.date(2006,12,25),\n '7-5-2000': datetime.date(2000,7,5),\n}\n\ncleaner.data_type = datetime.date\nfor raw, test in dates.items():\n assert cleaner.clean(raw) == test\n\ndatetimes = {\n '1994-11-05T08:15:30-05:00': datetime.datetime(\n 1994, 11, 5, 8, 15, 30,\n tzinfo= datetime.timezone(datetime.timedelta(hours=-5))),\n '1994-11-05T13:15:30Z':datetime.datetime(\n 1994, 11, 5, 13, 15, 30,\n tzinfo= datetime.timezone(datetime.timedelta(hours=0))),\n '1994-11-05T08:03:30-05:00': datetime.datetime(\n 1994, 11, 5, 8, 3, 30,\n tzinfo= datetime.timezone(datetime.timedelta(hours=-5))),\n '1994-11-05T13:03:30Z':datetime.datetime(\n 1994, 11, 5, 13, 3, 30,\n tzinfo= datetime.timezone(datetime.timedelta(hours=0))),\n '2006-08-03 18:55:30': datetime.datetime(2006,8,3,18,55,30),\n '03-Aug-2006 6:55:30 pm': datetime.datetime(2006,8,3,18,55,30),\n}\n\ncleaner.data_type = datetime.datetime\nfor raw, test in dates.items():\n assert cleaner.clean(raw) == test\n\n\n```\n\n## Contributing\nContributions are collaboration is welcome. For major changes, please contact me in advance to discuss.\n\nPlease make sure to update tests for any contribution, as appropriate.\n\n## Author\n\n[Kevin Crouse](mailto:krcrouse@gmail.com). Copyright, 2019.\n\n## License\n[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://gitlab.com/krcrouse/datacleaner", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "pydatacleaner", "package_url": "https://pypi.org/project/pydatacleaner/", "platform": "", "project_url": "https://pypi.org/project/pydatacleaner/", "project_urls": { "Homepage": "https://gitlab.com/krcrouse/datacleaner" }, "release_url": "https://pypi.org/project/pydatacleaner/0.1.2.5/", "requires_dist": null, "requires_python": "", "summary": "A small utility designed to process/parse/clean scalars, especially text. (note: in active development)", "version": "0.1.2.5" }, "last_serial": 5936626, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "8340e8f184b094b00c0860446f6d3064", "sha256": "93e7371f6780e23726028557e9860ef597933836d8254eebbda7ab1e9317ef7b" }, "downloads": -1, "filename": "pydatacleaner-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "8340e8f184b094b00c0860446f6d3064", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 15968, "upload_time": "2019-09-15T15:19:58", "url": "https://files.pythonhosted.org/packages/3c/19/6664925a091a1f7cfa3b845ab615b4e4b5bb8b206ff2fab38b5597d91239/pydatacleaner-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "22c7946fbfcf856b34c245b1b1688d31", "sha256": "582436f5905cea430a1efe7fe5f8614d8db6252b6d6da54dccf631f8316fba63" }, "downloads": -1, "filename": "pydatacleaner-0.1.tar.gz", "has_sig": false, "md5_digest": "22c7946fbfcf856b34c245b1b1688d31", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13749, "upload_time": "2019-09-15T15:20:00", "url": "https://files.pythonhosted.org/packages/7e/c5/ce8dcc54a062d1efe9d4d8c1a28e66e1b95e4f891d9bf0d274827d026076/pydatacleaner-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "31e7af4fb8ad259e4255d6a0d7afd0eb", "sha256": "eb8e62f5528ed1dd704646467596da422dc927caeaae682f8fe8d1fdf7a0ecc4" }, "downloads": -1, "filename": "pydatacleaner-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "31e7af4fb8ad259e4255d6a0d7afd0eb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 15999, "upload_time": "2019-09-15T15:25:16", "url": "https://files.pythonhosted.org/packages/e8/e3/6ae4e94a040e043e2d15e840e0d14e9198adbc485bc6b0d48d4c8e86dd1b/pydatacleaner-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8e06e05e9cb64ee43a33e55886713189", "sha256": "21c3bad92d81dc50385274d1135c83c6e907a0c0d6cacdf6d99997876b3240f2" }, "downloads": -1, "filename": "pydatacleaner-0.1.1.tar.gz", "has_sig": false, "md5_digest": "8e06e05e9cb64ee43a33e55886713189", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13771, "upload_time": "2019-09-15T15:25:18", "url": "https://files.pythonhosted.org/packages/fd/4f/1f23ec79775a110a0ebf25e32c75aaf86f5829dc8fbac87c38e832314525/pydatacleaner-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "0e23fe7557870418fa540748c96f1c48", "sha256": "cd79c50405589c4d3f2c87ea096df85aa32d0a2340e85bdf3a39d45306b38e98" }, "downloads": -1, "filename": "pydatacleaner-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "0e23fe7557870418fa540748c96f1c48", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16000, "upload_time": "2019-09-21T13:23:27", "url": "https://files.pythonhosted.org/packages/85/3b/0d07bfb10924f3a607d4bf3a767ecd9f804af1ffffdcc9558f446672681c/pydatacleaner-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "625f81ff9eb1a9def054b35e886c8b97", "sha256": "695e1e11fea9591e2d46f7961a7cae3c2b688e27881506f145c804dce6cb3dd7" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.tar.gz", "has_sig": false, "md5_digest": "625f81ff9eb1a9def054b35e886c8b97", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13779, "upload_time": "2019-09-21T13:23:29", "url": "https://files.pythonhosted.org/packages/0f/b8/350cbd29e26e1ac39d353e3279f0c6f84d400f57182022bbc03a78dc0120/pydatacleaner-0.1.2.tar.gz" } ], "0.1.2.2": [ { "comment_text": "", "digests": { "md5": "43777fef1b6136971f54c138269fa4f5", "sha256": "efd70402dbcc9e0bcb0830325b90261a8506b19ffae78c2e0be0b3ef3bc161ca" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "43777fef1b6136971f54c138269fa4f5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16118, "upload_time": "2019-09-21T13:27:51", "url": "https://files.pythonhosted.org/packages/00/29/ba81c14024fea63592837c2bf4664e8f3c1dc6f1b41339961ec70a8e9154/pydatacleaner-0.1.2.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5083bd628f4c99153d14927fa993d245", "sha256": "584502936b196f9dff6c9b2a02374ca96ded819f37f66315739206cc55b15c81" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.2.tar.gz", "has_sig": false, "md5_digest": "5083bd628f4c99153d14927fa993d245", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13979, "upload_time": "2019-09-21T13:27:52", "url": "https://files.pythonhosted.org/packages/4a/8d/2f16f5c75364b0662fe390b49b95b28b1776efeda28f057eb90b86db842f/pydatacleaner-0.1.2.2.tar.gz" } ], "0.1.2.3": [ { "comment_text": "", "digests": { "md5": "ce04ff928acd056f14a244238f524dbc", "sha256": "fc60b9c834f638432b1f931a5fc9fdc7f5d852758df73b641f859934001927a1" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.3-py3-none-any.whl", "has_sig": false, "md5_digest": "ce04ff928acd056f14a244238f524dbc", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16119, "upload_time": "2019-09-21T13:30:16", "url": "https://files.pythonhosted.org/packages/ec/6e/3017ec941b167c2881b11f2fb8277d4c04d886037d043c15c27152d3d0be/pydatacleaner-0.1.2.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "af25988bebbee7d38b14898de9f3408f", "sha256": "bbf51fd17fe4627daf5e608b6c1d135694b74c25de7d2af48ed68db66cad7742" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.3.tar.gz", "has_sig": false, "md5_digest": "af25988bebbee7d38b14898de9f3408f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13977, "upload_time": "2019-09-21T13:30:17", "url": "https://files.pythonhosted.org/packages/b1/3e/62ea1f2d95e88f9f5cfe4144d63d43b7a4af91f4aef46266ca1088855d79/pydatacleaner-0.1.2.3.tar.gz" } ], "0.1.2.5": [ { "comment_text": "", "digests": { "md5": "e0af422bc3be5b0ec92db89cde27639d", "sha256": "82d951594dfe56cbe27bea8c245590b0c1fe7c981479c4c4e620ada5955b02b9" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.5-py3-none-any.whl", "has_sig": false, "md5_digest": "e0af422bc3be5b0ec92db89cde27639d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16628, "upload_time": "2019-10-07T02:12:19", "url": "https://files.pythonhosted.org/packages/89/ff/8b318eeca739ddffd7f2b865e39362b6f00f29d9443945f0e484fef37787/pydatacleaner-0.1.2.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6148deba2eab0154191502777df3ea74", "sha256": "c6f97e7d6f7c729c6dfd10f9a20cd26542b3cff5efb3192695de425a22619754" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.5.tar.gz", "has_sig": false, "md5_digest": "6148deba2eab0154191502777df3ea74", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14436, "upload_time": "2019-10-07T02:12:21", "url": "https://files.pythonhosted.org/packages/fe/3d/f6f0b6e002c6792f08f469cc993bdc3e74e87ee4c39e16001b447a3f32a1/pydatacleaner-0.1.2.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e0af422bc3be5b0ec92db89cde27639d", "sha256": "82d951594dfe56cbe27bea8c245590b0c1fe7c981479c4c4e620ada5955b02b9" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.5-py3-none-any.whl", "has_sig": false, "md5_digest": "e0af422bc3be5b0ec92db89cde27639d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16628, "upload_time": "2019-10-07T02:12:19", "url": "https://files.pythonhosted.org/packages/89/ff/8b318eeca739ddffd7f2b865e39362b6f00f29d9443945f0e484fef37787/pydatacleaner-0.1.2.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6148deba2eab0154191502777df3ea74", "sha256": "c6f97e7d6f7c729c6dfd10f9a20cd26542b3cff5efb3192695de425a22619754" }, "downloads": -1, "filename": "pydatacleaner-0.1.2.5.tar.gz", "has_sig": false, "md5_digest": "6148deba2eab0154191502777df3ea74", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14436, "upload_time": "2019-10-07T02:12:21", "url": "https://files.pythonhosted.org/packages/fe/3d/f6f0b6e002c6792f08f469cc993bdc3e74e87ee4c39e16001b447a3f32a1/pydatacleaner-0.1.2.5.tar.gz" } ] }