{
"info": {
"author": "EN, PLLiao",
"author_email": "",
"bugtrack_url": null,
"classifiers": [
"Programming Language :: Python",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6"
],
"description": "# purewords\n\nPurewords is a package used to clean raw texts for all languages. \n\n## Install\n\n `pip install purewords`\n\n\n## Usage\n\n ### Module usage:\n\n ```python\n import purewords\n\n # raw sentence\n inputs = \"ha hi!!hello I\\'m at http:www.google.com.tw\\n\\n\" \n + \"you know yahoo? my_computer is great. My phone number\"\n + \"is 02-3366-5678.
\u7684\u5566
my password: 123-abc$99&^%Y)\\'_\\'(Y \"\n ```\n #### Treat inputs as a sentence and clean it. \n\n Word tokens are splitted with whitespace\n ```python\n # result: string\n purewords.clean_sentence(inputs)\n 'ha hi hello i am at _url_ you know yahoo my computer is great my phone number is _phone_ \u7684 my password _num_ abc _num_ y y'\n ```\n\n #### Treat inputs as a document and clean it.\n\n Split document with some confident splitting token such as '.' or '?'.\n ```python\n # result: list of cleaned string\n purewords.clean_document(inputs)\n ['ha hi', 'hello i am at _url_', 'you know yahoo', 'my computer is great', 'my phone number is _phone_', '\u7684 my password _num_ abc _num_ y y']\n ```\n\n ### Customed your purewords\n\n You can use different setting in purewords.\n\n ```python\n import purewords\n from purewords.tokenizer import YoctolTokenizer\n from purewords.filter_collection import document_filters\n from purewords.filter_collection import token_filters\n\n tokenizer = YoctolTokenizer()\n pw = purewords.PureWords(\n tokenizer=tokenizer, # select your tokenizer\n document_filters=document_filters, # select your document filters\n token_filters=token_filters, # select your token filters\n max_len=200, # cut long sentence whose length exceed max_len\n min_len=1 # ignore short sentence \n )\n\n inputs = 'This is a sentence.'\n\n pw.clean_sentence(inputs)\n pw.clean_document(inputs)\n ```\n\n #### Tokenizer\n\n ##### Select your tokenizer in purewords\n\n You can select `WhitespaceTokenizer` tokenizer if you prefer tokenize \n sentences with whitespace or `JiebaTokenizer` for default jieba setting.\n\n Otherwise, we use yoctol jeiba tokenizer as our default setting.\n\n ```python\n from purewords.tokenizer import WhitespaceTokenizer\n\n tokenizer = WhitespaceTokenizer()\n pw = purewords.PureWords(\n tokenizer=tokenizer\n )\n ```\n\n ##### Add new words in JiebaTokenizer\n\n You can add new word in JiebaTokenizer to customize your tokenizer.\n\n ```python\n from purewords.tokenizer import JiebaTokenizer\n\n tokenizer = JiebaTokenizer()\n tokenizer.add_word(new_word, freq, tag) # The setting is same with jieba.add_word\n tokenizer.add_words(new_word_list, freq, tag)\n\n pw = purewords.PureWords(\n tokenizer=tokenizer\n )\n ```\n\n #### Filter collection\n\n You can customize your preprocesing ways in purewords.\n\n * document_filters: preprocess the raw sentence before sentence splitting\n * token_filters: preprocess tokens after tokenization of each sentence\n\n ##### Organize your filters\n\n You can create your customized filters by adding your filters in our filter collection class.\n\n Filter means a callable object which receives a raw sentence and returns the processed one.\n\n The preprocessing order is consistent with the adding order of filters.\n\n ```python\n from purewords.filter_collection import BaseFilterCollection\n\n custom_filters = BaseFilterCollection()\n custom_filters.add(filter_1)\n custom_filters.add(filter_2)\n ...\n custom_filters.add(filter_n)\n\n pw = purewords.PureWords(\n tokenizer=tokenizer,\n document_filters=custom_filters,\n )\n ```\n\n #### Stopwords\n\n You can add stopwords in `purewords/config/stopwords.txt`.\n\n\n ### Command line usage:\n\n Preprocess text files into a single cleaned document from command line.\n\n Usage:\n\n #### Clean single txt files\n ```\n python -m purewords input_file_path\n ```\n\n #### Clean text files in a directory\n\n Or, you can use following command to clean all the txt files in your directory.\n ```\n python -m purewords -d your_raw_text_dir\n ```\n\n #### Ignore short sentences \n\n If you prefer long sentences and want to ignore short sentences less than 5 words, you can try this.\n ```\n python -m purewords -min 5 your_text_file\n ```\n\n #### Cut long sentences\n\n Or you prefer short sentences less than 30 words and want to cut long sentences into short sentences.\n\n You can set up the maximun sentence length like this.\n ```\n python -m purewords -max 30 your_text_file\n ```\n\n #### Use multi-thread to speed up\n\n You can also use multi-trhead to speed up the cleaning process. \n\n In the follwoing example, you clean all the text files with 4 threads\n ```\n python -m purewords -j 4 -d your_raw_text_dir\n ```\n\n\n",
"description_content_type": "",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "",
"keywords": "",
"license": "MIT",
"maintainer": "",
"maintainer_email": "",
"name": "purewords",
"package_url": "https://pypi.org/project/purewords/",
"platform": "",
"project_url": "https://pypi.org/project/purewords/",
"project_urls": null,
"release_url": "https://pypi.org/project/purewords/0.1.1/",
"requires_dist": [
"jieba (==0.39)",
"joblib (==0.12.2)"
],
"requires_python": ">=3.5",
"summary": "A NLP preprocessing package",
"version": "0.1.1"
},
"last_serial": 4142962,
"releases": {
"0.1.0": [
{
"comment_text": "",
"digests": {
"md5": "35796fa99103d7f0baaee9303e10d5d2",
"sha256": "9bc61a1805d982a28aa2f029c66392e497f5abc63a2060755083535bed344b71"
},
"downloads": -1,
"filename": "purewords-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "35796fa99103d7f0baaee9303e10d5d2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 3055546,
"upload_time": "2018-08-07T04:01:21",
"url": "https://files.pythonhosted.org/packages/a5/74/35d6d06ba9fd3ac91ccbfc6f5fe52eb6a02a127f163241828ceeb1c3218e/purewords-0.1.0-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "24ffc9899e2e6c379c97f3a6dfc7e95b",
"sha256": "85ed154b416001af359cee02ea338b52eeeb87e94a07b6950f5e0fca0656018b"
},
"downloads": -1,
"filename": "purewords-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "24ffc9899e2e6c379c97f3a6dfc7e95b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 3047097,
"upload_time": "2018-08-07T04:01:23",
"url": "https://files.pythonhosted.org/packages/e2/d4/b9a11cee36cfd49d7a0578c0838bf47df6f0ddffe0f17d88f1fa7a233a5e/purewords-0.1.0.tar.gz"
}
],
"0.1.1": [
{
"comment_text": "",
"digests": {
"md5": "229541ccd6992144623c5f179cfc1d4c",
"sha256": "49309c72076a9bb1629c12f3e844c6e4430579798342d0bbb5f5e7367bf00961"
},
"downloads": -1,
"filename": "purewords-0.1.1-py3.5.egg",
"has_sig": false,
"md5_digest": "229541ccd6992144623c5f179cfc1d4c",
"packagetype": "bdist_egg",
"python_version": "3.5",
"requires_python": ">=3.5",
"size": 3084793,
"upload_time": "2018-08-07T06:22:26",
"url": "https://files.pythonhosted.org/packages/da/79/506d4fd67b237e2bcd2d2e73a9d74ac6d763529b77bf4232e8ac9a7fb74f/purewords-0.1.1-py3.5.egg"
},
{
"comment_text": "",
"digests": {
"md5": "0f1c5390ff8c97c9a8cdb906e876e059",
"sha256": "ff7f32cc88aa33e3304ed043c33864101d4f98b5ac5a8fa3607fbd6b42711d8c"
},
"downloads": -1,
"filename": "purewords-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0f1c5390ff8c97c9a8cdb906e876e059",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 3055571,
"upload_time": "2018-08-07T06:22:21",
"url": "https://files.pythonhosted.org/packages/69/69/422ccfd15fd1821d50986105eafd953b7e726ef6ae695d922f9a597fa8ce/purewords-0.1.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "027ce88380820f5a5f7df24af413dbde",
"sha256": "b0c19ed7d744c3a2bfce6bd8f1350e562c83bb6c740705490c9cac7c5271a864"
},
"downloads": -1,
"filename": "purewords-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "027ce88380820f5a5f7df24af413dbde",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 3047055,
"upload_time": "2018-08-07T06:22:30",
"url": "https://files.pythonhosted.org/packages/1b/56/2cd7b56ff78ffd275989d830441b47813dc926f5947ec05ef8f15f691cfa/purewords-0.1.1.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "229541ccd6992144623c5f179cfc1d4c",
"sha256": "49309c72076a9bb1629c12f3e844c6e4430579798342d0bbb5f5e7367bf00961"
},
"downloads": -1,
"filename": "purewords-0.1.1-py3.5.egg",
"has_sig": false,
"md5_digest": "229541ccd6992144623c5f179cfc1d4c",
"packagetype": "bdist_egg",
"python_version": "3.5",
"requires_python": ">=3.5",
"size": 3084793,
"upload_time": "2018-08-07T06:22:26",
"url": "https://files.pythonhosted.org/packages/da/79/506d4fd67b237e2bcd2d2e73a9d74ac6d763529b77bf4232e8ac9a7fb74f/purewords-0.1.1-py3.5.egg"
},
{
"comment_text": "",
"digests": {
"md5": "0f1c5390ff8c97c9a8cdb906e876e059",
"sha256": "ff7f32cc88aa33e3304ed043c33864101d4f98b5ac5a8fa3607fbd6b42711d8c"
},
"downloads": -1,
"filename": "purewords-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0f1c5390ff8c97c9a8cdb906e876e059",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 3055571,
"upload_time": "2018-08-07T06:22:21",
"url": "https://files.pythonhosted.org/packages/69/69/422ccfd15fd1821d50986105eafd953b7e726ef6ae695d922f9a597fa8ce/purewords-0.1.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "027ce88380820f5a5f7df24af413dbde",
"sha256": "b0c19ed7d744c3a2bfce6bd8f1350e562c83bb6c740705490c9cac7c5271a864"
},
"downloads": -1,
"filename": "purewords-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "027ce88380820f5a5f7df24af413dbde",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 3047055,
"upload_time": "2018-08-07T06:22:30",
"url": "https://files.pythonhosted.org/packages/1b/56/2cd7b56ff78ffd275989d830441b47813dc926f5947ec05ef8f15f691cfa/purewords-0.1.1.tar.gz"
}
]
}