{ "info": { "author": "keyi.tang", "author_email": "keyit92@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Text Processor\n## Intro\nThis python package provides a easy-use interface to process human language text with extensive NLP resource, such as corporal, stemmers, tokenizers and language embeddings. The major goal is to ease the effort of integrating different nlp python packages.\n\nAll the text-processing moduels in this package are build on top of the NLPLibrary, which is the resource-management module that summarises all the known NLP resource, and provides a consistent interface to load queried NLP resource on the fly. \n\n\n\n### Initialize TextPreprocess and NLPLibrary\n```python\nIn:\n tp = TextPreprocess(NLPLibrary())\n```\n### Load the resource contents required for the target task (optional).\n```python\nIn:\n tp.load_content_from_library(\"resource_class_name\", \"resource_library_name\")\n\n tp.load_content_from_library(\"sentence_tokenizer\", \"nltk_eng_punkt\")\n```\n*Resource Class Name* is a required argument, which indicates the what type of resource you need. It can be stopword, punctuation or stemmer etc.\n\nEach type of resource can supported multiple libraries. *Resource Library Name* is indicates which library you want to use. As an example, you can use either *Porter ('resource_library_name': 'nltk_eng_porter')* or *Lancaster ('resource_library_name': 'nltk_eng_lancaster')* to stem (*'resource_class_name': 'stemmer'*) your text. \n*Resource Library Name* has default value for every class of resource. \n\nCalling this function for a loaded resource class will change the resource library. \n\n### Call NLP Functions\n```python\nIn:\n sentences_list = tp.tokenize_to_sentences(text)\n```\n*TextPreprocess* tends to encapsulate and organize all the NLP methods that are needed for preprocessing documents before ML phase. The calling function will use the resource loaded in its *NLPLibrary* to perform the task it responsible to. \n\nIf the required resource has not been loaded before calling, it will load the default resource autonomously to support its task.\n\n### Show Resource Catalog\n**TextPreprocess** and **NLPLibrary** provide interface to print out the resource list:\n```python\nIn:\n tp.show_library_catalog()\n```\nThis function can help find the available resources.\n\n### Show Loaded Items\n```python\nIn:\n tp.show_library_items()\n```\nThis function can tell the information about the loaded resources. \n\n## Scikit-Learn Moduels\nNLPUnit is the core interface which provides a scikit-learn wrapper for every NLP moduel, inlucding *Tokenizer*, *Normalizer*, *Filter*, *Encoder* etc. \n\nThree composite text-preprocessing modules are implemented:\n\n* [**Document2WordPage**](#doc2wp): transform a raw document to [Word Page](#dataflow-wordpage).\n* [**Documents2WordPages**](#docs2wps): transform a list of raw documents to a list of [Word Pages](#dataflow-wordpage).\n* [**Documents2BOW**](#docs2bow): transform a list of raw documents to [BOW](#dataflow-bow).\n\nThey are introduced in [Data Flow](#dataflow) section. \n\n**To Figure How to Work With Them, Please See Unit Test Files.**\n\n## Data Flow\n\n### Data Model Convention\n\nTo offer a consistent and intuitive interface, this package follows a name convention for text data. \n\n#### Documents(s)\n\n* Type: String or List of Strings\n* Data: An untokenized raw text or list of untokenized raw texts.\n\n#### Sentences\n\n* Type: List of Strings.\n* Data: List of sentences. Output of sentence tokenizer. Sentences are ordinal, i.e the sequential order of sentences is kept.\n\n#### Words\n\n* Type: List of String or List of String Tuple.\n* Data: List of word token or List of tagged word tokens. \t* For tagged words, each element is a tuple. The first member of the tuple is the word, and the second is its corresponding tag. \n\n#### Word Page\n\n* Type: List of String List or List of String Tuple List.\n* Data: List of [Words](#dataflow-words). Output of word tokenizer if the input is [Sentences](#dataflow-sentences). \n\t* Word Page is ordinal, i.e the sequential order of words is kept. \n\t* For tagged word page, each element is a tuple. The first member of the tuple is the word, and the second is its corresponding tag. \n\n#### Bags Of Words (BOW)\n* Type: List of String List or List of String Tuple List.\n* Data: List of [Words](#dataflow-words). \n\t* Words in BOW are not neccessarily ordinal, i.e no sequential order between words.\n\t* Each string list is a collection of representative words of a document. \n\n### Composite Text Preprocessing Blocks\n\n\n#### Document2WordPage\n\n**Document2WordPage** transforms a raw [Document](#dataflow-document) to a [Word Page](#dataflow-wordpage):\n\n\n\nAs all the blocks after WordTokenizer take [Word Page](#dataflow-wordpage) as input and output, they (CaseLower, POSTagger, Lemmatizer, TagCleaner) can be switched off to skip certain operations on the text data.\n\n#### Documents2WordPages\n\n**Documents2WordPages** transforms a list of raw [Documents](#dataflow-document) to a list of word pages. **Documents2WordPages** uses [Document2WordPage](#dataflow-doc2wp) block to map each input document to corresponding word page, and output them as a list in the same as the input.\n\n\n\nThe type of DocumentsWordPages output is a 3-neste-layer list of string.\n\n#### Documents2BOW\n\n**Documents2BOW** transforms a list of raw [Documents](#dataflow-document) to [BOW](#dataflow-bow).\n\n\n\n**TokenTensorReducer** merges lists in lower level of a given nested list. It transforms a List of [Word Page](#dataflow-wordpage) to BOW. Multiple filter blocks used in **Documents2BOW**, including POSFilter, Stopwords Filter and Punctuation Filter. *Filter Block* and *TagCleaner Block* can be switched off.\n\n## Showcase Example of TextPreprocess Interface\n* Given a simple text.\n\n```python\nIn:\n text = \"She likes dogs. Food is awesome! We went to library.\"\n```\n\n* Initialize a TextPreprocess instance by passing a NLPLibrary instance to its initializer. \n\n```python\nIn:\n tp = TextPreprocess(NLPLibrary())\n```\n\n* Tokenize the text into sentences and word sequences.\n\n```python\nIn:\n sentences_list = tp.tokenize_to_sentences(text)\n documents = tp.tokenize_sents_to_words(sentences_list)\n print(documents)\n```\n\n```python\nOut:\n [['She', 'likes', 'dogs', '.'], ['Food', 'is', 'awesome', '!'], ['We', 'went', 'to', 'library', '.']]\n```\n\n* Part-of-Speech (POS) Tagging tokens and Normalizing the text.\n\n```python\nIn:\n tagged_documents = tp.pos_tag(documents)\n normalized_documents = tp.lemmatize_documents(tagged_documents)\n print(normalized_documents)\n```\n\n```python\nOut:\n [['She', 'like', 'dog', '.'], ['Food', 'be', 'awesome', '!'], ['We', 'go', 'to', 'library', '.']]\n```\n\n* Keeping Verbs Only and Removing Other Words.\n\n```python\nIn:\n verbs_in_sents = tp.focus_on_pos_tag_type(documents, ['verb'])\n print(verbs_in_sents)\n```\n\n```python\nOut:\n [['like'], ['be'], ['go']]\n```\n\n## TODO List\n* Model Evaluation Modules\n* Spelling Checking Modules\n* Sentence Structure Filtering Modules\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/KeyiT/txplib", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "txplib", "package_url": "https://pypi.org/project/txplib/", "platform": "", "project_url": "https://pypi.org/project/txplib/", "project_urls": { "Homepage": "https://github.com/KeyiT/txplib" }, "release_url": "https://pypi.org/project/txplib/0.1/", "requires_dist": [ "numpy (==1.14.5)", "nltk (==3.3)", "scipy (==1.1.0)" ], "requires_python": "", "summary": "text preprocessing utils.", "version": "0.1" }, "last_serial": 4744394, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "7b0f71072b7720523c49180121fb8cec", "sha256": "c0947df42cc3d1a2fd5e5d2137f540039a7b1faa94f10a7f74fc8d43f4b9f96b" }, "downloads": -1, "filename": "txplib-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "7b0f71072b7720523c49180121fb8cec", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7184, "upload_time": "2019-01-26T19:57:33", "url": "https://files.pythonhosted.org/packages/43/e0/a700dc5a10a26b6ecb5fb7a845f86eff2c448896d77228d40043ae2eaffe/txplib-0.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7b0f71072b7720523c49180121fb8cec", "sha256": "c0947df42cc3d1a2fd5e5d2137f540039a7b1faa94f10a7f74fc8d43f4b9f96b" }, "downloads": -1, "filename": "txplib-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "7b0f71072b7720523c49180121fb8cec", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7184, "upload_time": "2019-01-26T19:57:33", "url": "https://files.pythonhosted.org/packages/43/e0/a700dc5a10a26b6ecb5fb7a845f86eff2c448896d77228d40043ae2eaffe/txplib-0.1-py3-none-any.whl" } ] }