{ "info": { "author": "huntzhan", "author_email": "programmer.zhx@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 1 - Planning", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Software Development :: Libraries" ], "description": "# text-cleaner, simple text preprocessing tool\n\n## Introduction\n\n* Support Python 2.7, 3.3, 3.4, 3.5.\n* Simple interfaces.\n* Easy to extend.\n\n## Install\n\n```\npip install text-cleaner\n```\n\n**WARNING FOR PYTHON 2.7 USERS**: Only UCS-4 build is supported(`--enable-unicode=ucs4`), UCS-2 build ([see this](http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to-not-get-a-bad-character-range-error)) is **NOT SUPPORTED** in the latest version.\n\n## Usage\n\n```python\nfrom text_cleaner import remove, keep\n\nfrom text_cleaner.processor.common import ASCII\nfrom text_cleaner.processor.chinese import CHINESE, CHINESE_SYMBOLS_AND_PUNCTUATION\nfrom text_cleaner.processor.misc import RESTRICT_URL\n\n# remove url and ascii characters.\n# return: u'\u70b9\u51fb \u67e5\u770b '\nremove(\n '\u70b9\u51fbhttp://t.cn/RtU0mZ1 \u67e5\u770b,123456,test',\n [RESTRICT_URL, ASCII],\n)\n\n# remove only Chinese punctuation.\n# return: u'\u70b9\u51fb http://t.cn/RtU0mZ1 \u67e5\u770b,123456,test '\nremove(\n '\u70b9\u51fb\uff1ahttp://t.cn/RtU0mZ1\uff0c \u67e5\u770b,123456,test\u3002\uff01\uff1f',\n [RESTRICT_URL, ASCII],\n)\n\n# keep chinese characters and url.\n# return: u'\u70b9\u51fb http://t.cn/RtU0mZ1 \u67e5\u770b'\nkeep(\n '\u70b9\u51fbhttp://t.cn/RtU0mZ1 \u67e5\u770b,123456,test',\n [CHINESE, RESTRICT_URL],\n)\n\n# use processor directly.\n# return: u'\u70b9\u51fb \u67e5\u770b'\nRESTRICT_URL.remove('\u70b9\u51fbhttp://t.cn/RtU0mZ1 \u67e5\u770b')\n# return: u'\u70b9\u51fb \u67e5\u770b'\nRESTRICT_URL.replace('').remove('\u70b9\u51fbhttp://t.cn/RtU0mZ1 \u67e5\u770b')\n```\n\n## Interfaces\n\n*text_cleaner.remove(text, processors)*:\n\n* *text*: `str` or `bytes` (`unicode` or `str` for Python 2).\n* *processors*: iterable of processors. *remove* invokes `remove` of each processor to handle *text*.\n\n*text_cleaner.keep(text, processors)*:\n\n* same as *remove*, but invoke `keep` method of processors instead.\n\n## Processors\n\n*DEFAULT\\_REPLACE\\_TEXT*: `' '`, single space.\n\n*RegexProcessor(regex, replace\\_text=DEFAULT\\_REPLACE\\_TEXT)*\n\n* contruct a regex processor for *regex*, replace unmatched components with *replace\\_text*.\n* *replace(self, new\\_replace\\_text)*: create a new processor, with new *replace\\_text* is set.\n* *remove(self, text)*: remove all occurences of *regex* from *text*.\n* *keep(self, text)*: keep only the occurences of *regex*, remove all unmatched components from *text*.\n* *verify(self, text)*: return *True* if text match *regex*, otherwise returns *False*.\n\n*UnicodeRange(begin, end)*:\n\n* *begin*: *int*, the begin of unicode range.\n* *end*: *int*, the end of unicode range.\n\n*UnicodeRangeProcessor(ranges, replace\\_text=DEFAULT\\_REPLACE\\_TEXT)*\n\n* subclass of *RegexProcessor*.\n* *ranges*: iterable of instances of *UnicodeRange*.\n\n## Built-in Processors\n\nFollowing processors are defined by *UnicodeRange* and regex. Read the source code if you are sure about what's going on.\n\n`text_cleaner.processor.common`, for common usage:\n\n* `ALPHA`\n* `DIGIT`\n* `SYMBOLS_AND_PUNCTUATION`\n* `ASCII`\n* `ALPHA_EXTENSION`\n* `DIGIT_EXTENSION`\n* `SYMBOLS_AND_PUNCTUATION_EXTENSION`\n* `GENERAL_PUNCTUATION`\n\n`text_cleaner.processor.misc`, misellanious processors:\n\n* `URL`\n* `RESTRICT_URL`\n* `ESCAPED_WHITESPACE`\n* `WECHAT_EMOJI_EN`\n* `WECHAT_EMOJI_ZHCN`\n* `WECHAT_EMOJI`\n\n`text_cleaner.processor.chinese`, Chinese processing:\n\n* `CHINESE_CHARACTER`: only common characters.\n* `CHINESE`: common characters + symbols and puntuations.\n* `CHINESE_ALL`: all CJK characters.\n* `CHINESE_EXTENSION`\n* `CHINESE_COMPATIBILITY`\n* `CHINESE_SYMBOLS_AND_PUNCTUATION`\n\n### URL vs. RESTRICT_URL\n\nHow to define URLs is a complex problem.\nWe provide two choices for our users.\n\n* `URL`: truncate urls till whitespaces.\n* `RESTRICT_URL`: truncate urls till non-whitespace ASCII ([!-~] in the ASCII table)\n\nFor Chinese users, we recommend using `RESTRICT_URL`.\n\n```python\nfrom text_cleaner.processor.misc import RESTRICT_URL, URL\n\nURL.remove('\u70b9\u51fbhttp://t.cn/RtU0mZ1 \u67e5\u770b')\n# '\u70b9\u51fb \u67e5\u770b'\n\nURL.remove('\u70b9\u51fbhttp://t.cn/RtU0mZ1\u67e5\u770b')\n# '\u70b9\u51fb '\n\nRESTRICT_URL.remove('\u70b9\u51fbhttp://t.cn/RtU0mZ1\u67e5\u770b')\n# '\u70b9\u51fb \u67e5\u770b'\n```\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "text_cleaner", "package_url": "https://pypi.org/project/text_cleaner/", "platform": "", "project_url": "https://pypi.org/project/text_cleaner/", "project_urls": null, "release_url": "https://pypi.org/project/text_cleaner/0.2.6/", "requires_dist": null, "requires_python": "", "summary": "purify text for NLP.", "version": "0.2.6" }, "last_serial": 2734175, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "cc0084922e8a4881341114ac26bd594e", "sha256": "4805c12012917f86484c00153cd9d300787dbf13e0bac172466dce9bbde98187" }, "downloads": -1, "filename": "text_cleaner-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "cc0084922e8a4881341114ac26bd594e", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 8616, "upload_time": "2016-07-24T08:14:12", "url": "https://files.pythonhosted.org/packages/fa/ea/28dfed7ec884fc3c2f93c07be8193a656669494ca6938fb921fc81d5ce00/text_cleaner-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "120acb94dfa6c6d4f137da07284d2154", "sha256": "f7ea120c89b03816cae2bb1852abb0dc4a17a0e099fd4cad3aaadd5d6e5c7b5d" }, "downloads": -1, "filename": "text_cleaner-0.1.0.tar.gz", "has_sig": false, "md5_digest": "120acb94dfa6c6d4f137da07284d2154", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8117, "upload_time": "2016-07-24T08:14:07", "url": "https://files.pythonhosted.org/packages/4f/39/3cd5e04eafd8a8eeac1be292be788055c06b803c732e5a5bd148b0ac7d8d/text_cleaner-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "45f8958d7b8036b0e4153cc7f5538884", "sha256": "e5b307260b78396a21a69d88bf185008d52cc337c0212d236f3bbbcfa7cdad05" }, "downloads": -1, "filename": "text_cleaner-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "45f8958d7b8036b0e4153cc7f5538884", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 9406, "upload_time": "2016-07-26T05:34:04", "url": "https://files.pythonhosted.org/packages/56/cf/61c4b16618f58f623c0b9f55f652c96d2b2e5cc98c809e36d3a9412a47eb/text_cleaner-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1b02c0921fee5e442920a3eb9573e7b6", "sha256": "6376fbe179c1094bc2883d2ffd42ade6537b64b09ab2463aad912e13bd29da33" }, "downloads": -1, "filename": "text_cleaner-0.1.1.tar.gz", "has_sig": false, "md5_digest": "1b02c0921fee5e442920a3eb9573e7b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8962, "upload_time": "2016-07-26T05:33:59", "url": "https://files.pythonhosted.org/packages/40/c6/03580c5d82654b3bc203de6ead97d6213fcd2028edf990e60af5d638a21b/text_cleaner-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "d9b0ae35acf754ed5f6d97adea6972bc", "sha256": "25dbb9eaa63309c8f7f19bd9d20a9c03cadb1665a1ec0f6da816d6f9a24ef6d0" }, "downloads": -1, "filename": "text_cleaner-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d9b0ae35acf754ed5f6d97adea6972bc", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10754, "upload_time": "2016-07-27T08:07:49", "url": "https://files.pythonhosted.org/packages/47/c8/56582cbd2cbda20c8060604beefb1a05e72869f43bb0a534cd881f406050/text_cleaner-0.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4c5a49101489d59ae53b835c751100d9", "sha256": "9e5c9288c8da9911dbb3c7cbb6691a56dcd632e2e194221ab7187758271f206e" }, "downloads": -1, "filename": "text_cleaner-0.2.0.tar.gz", "has_sig": false, "md5_digest": "4c5a49101489d59ae53b835c751100d9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10621, "upload_time": "2016-07-27T08:07:45", "url": "https://files.pythonhosted.org/packages/70/7e/95551b4e577a7c587ecb47b467f2813e3e7ba2acb36d1f825d8f6a02dce4/text_cleaner-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "ca9dc7c965e7cf163fd363137dd34230", "sha256": "d64a3cc5c032d8ef0f87806fdcd9e70eacf85da835ac0a036b35640a15991002" }, "downloads": -1, "filename": "text_cleaner-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ca9dc7c965e7cf163fd363137dd34230", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10733, "upload_time": "2016-07-28T07:05:19", "url": "https://files.pythonhosted.org/packages/02/ac/301e5425d4fe233eb78182996a5ebc4cb850e48d275b28cb2dedb90169d3/text_cleaner-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "73aa16ba18fc357160423f73df61938d", "sha256": "aa826ca8666c983f79ba326749eac83216a4599cdd157b16c55e6e9449239ff6" }, "downloads": -1, "filename": "text_cleaner-0.2.1.tar.gz", "has_sig": false, "md5_digest": "73aa16ba18fc357160423f73df61938d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10744, "upload_time": "2016-07-28T07:05:15", "url": "https://files.pythonhosted.org/packages/fd/ee/9b312d3457957c60c134f1311e3f79e991d3c452261f2fea68a6774c6724/text_cleaner-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "c3c0835d231c4be9c7a018c37eff38f3", "sha256": "74cf8ce8de61ac3f6dfe9597847543aa7bc2ee19a061e96b490f3061483e0e59" }, "downloads": -1, "filename": "text_cleaner-0.2.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c3c0835d231c4be9c7a018c37eff38f3", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10833, "upload_time": "2016-08-11T03:26:15", "url": "https://files.pythonhosted.org/packages/17/4f/6f0e12e3694cb8faf0d21db0d19e265c744ef02b47c4e79dbe3527e1643d/text_cleaner-0.2.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0f799ae51b640bcac4031e1f77e56215", "sha256": "e00926f3d3446bf615667df5a2dc4884ea85e5162bcd0bd7a6e104a6ca051e4e" }, "downloads": -1, "filename": "text_cleaner-0.2.2.tar.gz", "has_sig": false, "md5_digest": "0f799ae51b640bcac4031e1f77e56215", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13754, "upload_time": "2016-08-11T03:26:12", "url": "https://files.pythonhosted.org/packages/a1/81/48965d92e73db49cc046f0a6ebe8ffd7f7b376993bfd2a7bfc0b05be03b9/text_cleaner-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "8973ff365cb85c1f35111c93c54dfcba", "sha256": "99d79444d9b7dc67c83030d5e9d4baf33041125c6e88584ab2a02662033e152c" }, "downloads": -1, "filename": "text_cleaner-0.2.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8973ff365cb85c1f35111c93c54dfcba", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10926, "upload_time": "2016-08-19T07:14:20", "url": "https://files.pythonhosted.org/packages/20/cd/41dec7f4c8a41fbfcd7b7854d6ce9aab9b4f0c775c452e7b663f87f9b213/text_cleaner-0.2.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6df51ee444b65a4ab773db9750a8ec33", "sha256": "d1ac572bdc06ada786f0e412f8304792ad26dcea270296ccb7caddf92c5e21d2" }, "downloads": -1, "filename": "text_cleaner-0.2.3.tar.gz", "has_sig": false, "md5_digest": "6df51ee444b65a4ab773db9750a8ec33", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13873, "upload_time": "2016-08-19T07:14:15", "url": "https://files.pythonhosted.org/packages/ba/01/c312fb8c63938e39336975ed09033e51171f7e366a8ef75314fb9a16bb0f/text_cleaner-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "9e1880ef086dce44d569bc259a6b874c", "sha256": "71ab5ea421a3585ef17e909affbb717af3ebfa7cf1b365031167b6118d1c40c1" }, "downloads": -1, "filename": "text_cleaner-0.2.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9e1880ef086dce44d569bc259a6b874c", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10947, "upload_time": "2016-08-22T11:58:35", "url": "https://files.pythonhosted.org/packages/b2/a9/84c2e030a3eea41800e0be10a646984a28e2fd9a8c7d94c16d4cbcc584c5/text_cleaner-0.2.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "af9a490f70d8b761d0ce7f6e707286c1", "sha256": "d651d2a00d1c0da201efc494510d7362df235ab24b5e587e2dde7dc0ba3da81e" }, "downloads": -1, "filename": "text_cleaner-0.2.4.tar.gz", "has_sig": false, "md5_digest": "af9a490f70d8b761d0ce7f6e707286c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13895, "upload_time": "2016-08-22T11:58:31", "url": "https://files.pythonhosted.org/packages/98/65/41047bbaffc1890b3849ad5016d4f7be2e939109d1e9a1331d63e613e528/text_cleaner-0.2.4.tar.gz" } ], "0.2.5": [ { "comment_text": "", "digests": { "md5": "81c555d061031c4dc5ea335e66eca012", "sha256": "3862b9d929f1276b1567ce7a36868abc85875584085cb4055ee82c2837c3cfb0" }, "downloads": -1, "filename": "text_cleaner-0.2.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "81c555d061031c4dc5ea335e66eca012", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 10993, "upload_time": "2016-10-03T15:58:39", "url": "https://files.pythonhosted.org/packages/16/82/28f1b85133c90412c116b8b21a53bd00137afb955d06a2b7efa75ea91866/text_cleaner-0.2.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "064781f0c74310a34da3800c74555d9a", "sha256": "343fac1ce0bf6ae15725bac3f43777bcbf4c299313f7c8a44619d81591caa110" }, "downloads": -1, "filename": "text_cleaner-0.2.5.tar.gz", "has_sig": false, "md5_digest": "064781f0c74310a34da3800c74555d9a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13950, "upload_time": "2016-10-03T15:58:35", "url": "https://files.pythonhosted.org/packages/86/2b/c7f4c1b9199237aaf8f62b98ccb8a916bf17014ad9eba8c99d640394af94/text_cleaner-0.2.5.tar.gz" } ], "0.2.6": [ { "comment_text": "", "digests": { "md5": "24f743de4d00aadbac541d437197f3e7", "sha256": "1edbf15f212593824f8f9ddd651b82f7735505576142fbc60b426b5b11de197c" }, "downloads": -1, "filename": "text_cleaner-0.2.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "24f743de4d00aadbac541d437197f3e7", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 11511, "upload_time": "2017-03-27T18:12:55", "url": "https://files.pythonhosted.org/packages/b0/8b/446922cde7c811dbb8fb71ffe75becd28c9fdb83e62438f6f64d1372e17d/text_cleaner-0.2.6-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a8c5d2caf1f2117fa43ff45de5b1f1f8", "sha256": "149117a31b5c03e224f956ad274648b9c8d37bf7c05234682728aaa4d6dc6367" }, "downloads": -1, "filename": "text_cleaner-0.2.6.tar.gz", "has_sig": false, "md5_digest": "a8c5d2caf1f2117fa43ff45de5b1f1f8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14535, "upload_time": "2017-03-27T18:12:48", "url": "https://files.pythonhosted.org/packages/ab/d4/3fb4c54528a6a44ddc41b9a582af03ef2caef111d977b491d455b9cec1a7/text_cleaner-0.2.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "24f743de4d00aadbac541d437197f3e7", "sha256": "1edbf15f212593824f8f9ddd651b82f7735505576142fbc60b426b5b11de197c" }, "downloads": -1, "filename": "text_cleaner-0.2.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "24f743de4d00aadbac541d437197f3e7", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 11511, "upload_time": "2017-03-27T18:12:55", "url": "https://files.pythonhosted.org/packages/b0/8b/446922cde7c811dbb8fb71ffe75becd28c9fdb83e62438f6f64d1372e17d/text_cleaner-0.2.6-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a8c5d2caf1f2117fa43ff45de5b1f1f8", "sha256": "149117a31b5c03e224f956ad274648b9c8d37bf7c05234682728aaa4d6dc6367" }, "downloads": -1, "filename": "text_cleaner-0.2.6.tar.gz", "has_sig": false, "md5_digest": "a8c5d2caf1f2117fa43ff45de5b1f1f8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14535, "upload_time": "2017-03-27T18:12:48", "url": "https://files.pythonhosted.org/packages/ab/d4/3fb4c54528a6a44ddc41b9a582af03ef2caef111d977b491d455b9cec1a7/text_cleaner-0.2.6.tar.gz" } ] }