{ "info": { "author": "Johannes Filter", "author_email": "hi@jfilter.de", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# `clean-text` [![Build Status](https://travis-ci.com/jfilter/clean-text.svg?branch=master)](https://travis-ci.com/jfilter/clean-text) [![PyPI](https://img.shields.io/pypi/v/clean-text.svg)](https://pypi.org/project/clean-text/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/clean-text.svg)](https://pypi.org/project/clean-text/)\n\nClean your text with `clean-text` to create normalized text representations. For instance, turn this corrupted input:\n\n```txt\nA bunch of \\\\u2018new\\\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).\n\n\n\u00bbY\u00f3\u00f9 \u00e0r\u00e9 r\u00efght <3!\u00ab\n```\n\ninto this clean output:\n\n```txt\nA bunch of 'new' references, including [moana]().\n\n\"you are right <3!\"\n```\n\n`clean-text` uses [ftfy](https://github.com/LuminosoInsight/python-ftfy), [unidecode](https://github.com/takluyver/Unidecode) and numerous hand-crafted rules, i.e., RegEx.\n\n## Installation\n\nTo install the GPL-licensed package [unidecode](https://github.com/takluyver/Unidecode) alongside:\n\n```bash\npip install clean-text[gpl]\n```\n\nYou may want to abstain from GPL:\n\n```bash\npip install clean-text\n```\n\nIf [unidecode](https://github.com/takluyver/Unidecode) is not available, `clean-text` will resort to Python's [unicodedata.normalize](https://docs.python.org/3.7/library/unicodedata.html#unicodedata.normalize) for [transliteration](https://en.wikipedia.org/wiki/Transliteration).\nTransliteration to closest ASCII symbols involes manually mappings, i.e., `\u00ea` to `e`. Unidecode's hand-crafted mapping is superiour but unicodedata's are sufficent.\nHowever, you may want to disable this feature altogether depening on your data and use case.\n\n## Usage\n\n```python\nfrom cleantext import clean\n\nclean(\"some input\",\n fix_unicode=True, # fix various unicode errors\n to_ascii=True, # transliterate to closest ASCII representation\n lower=True, # lowercase text\n no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them\n no_urls=False, # replace all URLs with a special token\n no_emails=False, # replace all email addresses with a special token\n no_phone_numbers=False, # replace all phone numbers with a special token\n no_numbers=False, # replace all numbers with a special token\n no_digits=False, # replace all digits with a special token\n no_currency_symbols=False, # replace all currency symbols with a special token\n no_punct=False, # fully remove punctuation\n replace_with_url=\"\",\n replace_with_email=\"\",\n replace_with_phone_number=\"\",\n replace_with_number=\"\",\n replace_with_digit=\"0\",\n replace_with_currency_symbol=\"\",\n lang=\"en\" # set to 'de' for German special handling\n)\n```\n\nCarefully choose the arguments that fit your task. The default parameters are listed above. Whitespace is always normalized.\n\nYou may also only use specific functions for cleaning. For this, take a look at the [source code](https://github.com/jfilter/clean-text/blob/master/cleantext/clean.py).\n\nSo far, only English and German are fully supported. It should work for the majority of Western languages. If you need some special handling for you language, feel free to contribute. \ud83d\ude43\n\n## Development\n\n- install [Pipenv](https://pipenv.readthedocs.io/en/latest/)\n- get the package: `git clone https://github.com/jfilter/clean-text && cd clean-text && pipenv install`\n- run tests: `pipenv run pytest`\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/jfilter/clean-text/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\nIf you don't like the output of `clean-text`, consider adding a [test](https://github.com/jfilter/clean-text/tree/master/tests) with your specific input and desired output.\n\n## Related Work\n\n- https://github.com/pudo/normality\n- https://github.com/davidmogar/cucco\n\n## Acknowledgements\n\nBuilt upon the work by [Burton DeWilde](https://github.com/bdewilde)'s for [Textacy](https://github.com/chartbeat-labs/textacy).\n\n## License\n\nApache\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jfilter/clean-text", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "clean-text", "package_url": "https://pypi.org/project/clean-text/", "platform": "", "project_url": "https://pypi.org/project/clean-text/", "project_urls": { "Homepage": "https://github.com/jfilter/clean-text" }, "release_url": "https://pypi.org/project/clean-text/0.1.1/", "requires_dist": [ "ftfy", "unidecode ; extra == 'gpl'" ], "requires_python": "", "summary": "Clean Your Text to Create Normalized Text Representations", "version": "0.1.1" }, "last_serial": 5184295, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "d660eb4050eed4fa1d24f2c9edad1403", "sha256": "c90bcd27aefbaf9656c9ebcc18c60deaa01ee1dedea0f6b9474c9de4b19ed83d" }, "downloads": -1, "filename": "clean_text-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d660eb4050eed4fa1d24f2c9edad1403", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9033, "upload_time": "2019-04-24T20:19:53", "url": "https://files.pythonhosted.org/packages/23/98/2650271bc1052002ad7e61595f7a44ff24f6bb4eb24d9c0e42e92c991708/clean_text-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "28ba5fa6a9abb0321df4455c3a816657", "sha256": "dcd547366a35c27b49897793ec6b0ef4f0dcfa772c5b1f2343ee81b7fe2378b3" }, "downloads": -1, "filename": "clean-text-0.1.1.tar.gz", "has_sig": false, "md5_digest": "28ba5fa6a9abb0321df4455c3a816657", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7514, "upload_time": "2019-04-24T20:20:01", "url": "https://files.pythonhosted.org/packages/be/49/ae6a7ee2e840017653beff7ed0548d44cac799ccf8e970727ab56f6a6095/clean-text-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d660eb4050eed4fa1d24f2c9edad1403", "sha256": "c90bcd27aefbaf9656c9ebcc18c60deaa01ee1dedea0f6b9474c9de4b19ed83d" }, "downloads": -1, "filename": "clean_text-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d660eb4050eed4fa1d24f2c9edad1403", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9033, "upload_time": "2019-04-24T20:19:53", "url": "https://files.pythonhosted.org/packages/23/98/2650271bc1052002ad7e61595f7a44ff24f6bb4eb24d9c0e42e92c991708/clean_text-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "28ba5fa6a9abb0321df4455c3a816657", "sha256": "dcd547366a35c27b49897793ec6b0ef4f0dcfa772c5b1f2343ee81b7fe2378b3" }, "downloads": -1, "filename": "clean-text-0.1.1.tar.gz", "has_sig": false, "md5_digest": "28ba5fa6a9abb0321df4455c3a816657", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7514, "upload_time": "2019-04-24T20:20:01", "url": "https://files.pythonhosted.org/packages/be/49/ae6a7ee2e840017653beff7ed0548d44cac799ccf8e970727ab56f6a6095/clean-text-0.1.1.tar.gz" } ] }