{ "info": { "author": "Evan Lalopoulos", "author_email": "evan.lalopoulos.2017@my.bristol.ac.uk", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "[![Build Status](https://travis-ci.org/evanll/nlpkit-ml.svg?branch=master)](https://travis-ci.org/evanll/nlpkit-ml)\n\n# NLPkit - Transformers for text classification\nA library of scikit compatible text transformers, that are ready to be integrated in an NLP pipeline for various classification tasks. \n\n## Project structure\n .\n \u251c\u2500\u2500 nlpkit\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 nlp_feature_extraction\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 liwc\n \u2502 \u251c\u2500\u2500 word_embeddings_features.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 syntax_features.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 ner_features.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 pos_features.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 liwc_features.py\n \u2502 \u251c\u2500\u2500 text_statistics_features.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u2514\u2500\u2500 tests\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 test_data\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 test_liwc_feature_extraction.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 test_ner_feature_extraction.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 test_pos_feature_extraction.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u251c\u2500\u2500 test_syntax_feature_extraction.py\n \u2502\u00a0\u00a0 \u00a0\u00a0 \u00a0\u00a0 \u2514\u2500\u2500 test_word_embeddings_feature_extraction.py\n \u251c\u2500\u2500 examples\n \u251c\u2500\u2500 README.md\n \u251c\u2500\u2500 LICENCE\n \u251c\u2500\u2500 requirements.txt\n \u2514\u2500\u2500 setup.py\n\n## Getting Started\n\nThese instructions will get you a copy of the project up and running on your local machine.\n\n### Prerequisites\n\n1. Python 3.6\n2. Stanford CoreNLP Server (for some transformers)\n3. Pre-trained word vectors for the w2v transformer\n\n### Stanford CoreNLP Server with Docker\nStanford CoreNLP is required for constituency parsing, POS and NER tagging.\n\nThe easiest way is to have a CoreNLP Server running is to use Docker. You can find a Dockerfile and instructions to have the server running at [Stanford CoreNLP Server - Docker](https://github.com/evanlal/stanford-corenlp-docker).\n\n### Word vectors\nIf you don't want to train your own word embeddings, you can download pre-trained word vectors from the [Stanford GloVe project](http://nlp.stanford.edu/projects/glove). \nFor example, the Wikipedia model has a vocabulary of 400K words, represented using 300 dimensional vectors. \nThe word vectors come in the GloVe format and need to be converted into the word2vec format. \nWhile the formats are almost identical, you can use gensim to do the conversion.\n\n```\npython -m gensim.scripts.glove2word2vec -i glove.txt -o word2vec.txt\n```\n\n## List of transformers\n- POSTagPreprocessor: Pre-processes text documents by tagging each word in the form of word_TAG_ e.g. what_WP. Can be used to generate POS tagged n-grams\n- NERPreprocessor: Pre-processes text documents by replacing named entities with generic tags e.g. PERSON, LOCATION\n- WordEmbedsDocVectorizer: Converts text documents to word2vec based document vector representations. It maps \nthe words of a document to word2vec vectors, and averages them across dimensions to produce a document vector\nrepresentation\n- POSExtractor: Extracts Parts of Speech (POS) counts for a collection of text documents\n- CFGExtractor: Extracts the Context Free Grammar (CFG) production rules found in a collection of text documents\n- NamedEntitiesCounter: Extracts Named Entity counts per entity type (e.g. PERSON) for a collection of text documents\n- LIWCExtractor: Extracts proportions of words that fall in the various LIWC categories for a collection of text documents\n- TextStatsExtractor: Calculates various text statistics and readability scores for a collection of text documents\n\n## Usage\nAll the custom transformers extend the BaseEstimator and TransformerMixin and implement the fit and transform methods.\n```python\n# POSExtractor\nsf_parser = CoreNLPParser(url='http://localhost:9000/', tagtype='pos')\npos_extractor = POSExtractor(sf_parser)\nX = pos_extractor.fit_transform(corpus)\n```\n\n\nThey can also be used in pipelines e.g.\n\n``` python\n\npipeline = Pipeline([\n ('pre', TextPreprocessor(stemming=False)),\n ('w2v', WordEmbedsDocVectorizer(self._word2vec, tfidf_weights=True)),\n ('clf', SVC(kernel='linear', C=1, probability=True))\n ])\n```\n\nFor more, you can run the examples included in the examples folder.\n\n## Tests\nThe Pytest framework is used for unit testing. All of the custom text transformers produced in this project come with an extensive set of unit tests.\nTo run the tests use: \n`pytest src`\n\n## Project repository\nhttps://github.com/evanll/nlpkit-ml\n\n## Author\nWritten by Evan Lalopoulos as part of his thesis in [Fake News detection using NLP](https://github.com/evanll/fake-news-nlp).\n\n**Evan Lalopoulos** - [evanll](https://github.com/evanll)\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/evanll/nlpkit-ml", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "nlpkit-ml", "package_url": "https://pypi.org/project/nlpkit-ml/", "platform": "", "project_url": "https://pypi.org/project/nlpkit-ml/", "project_urls": { "Homepage": "https://github.com/evanll/nlpkit-ml" }, "release_url": "https://pypi.org/project/nlpkit-ml/1.0.0/", "requires_dist": [ "Pyphen (==0.9.5)", "atomicwrites (==1.3.0)", "attrs (==19.1.0)", "boto3 (==1.9.220)", "boto (==2.49.0)", "botocore (==1.12.220)", "certifi (==2019.6.16)", "chardet (==3.0.4)", "docutils (==0.15.2)", "gensim (==3.4.0)", "idna (==2.8)", "importlib-metadata (==0.19)", "jmespath (==0.9.4)", "more-itertools (==7.2.0)", "nltk (==3.4.5)", "numpy (==1.14.5)", "packaging (==19.1)", "pluggy (==0.12.0)", "py (==1.8.0)", "pyparsing (==2.4.2)", "pytest (==5.1.2)", "python-dateutil (==2.8.0)", "repoze.lru (==0.7)", "requests (==2.22.0)", "s3transfer (==0.2.1)", "scikit-learn (==0.19.1)", "scipy (==1.1.0)", "six (==1.12.0)", "smart-open (==1.8.4)", "textstat (==0.4.1)", "urllib3 (==1.25.3)", "wcwidth (==0.1.7)", "zipp (==0.6.0)" ], "requires_python": ">=3.6", "summary": "A library of scikit compatible text transformers, that are ready to be integrated in an NLP pipeline for various classification tasks.", "version": "1.0.0" }, "last_serial": 5805715, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "8289d9f02cfdc56c33f7f90981ba2c7e", "sha256": "4e3a205ae78d7e5670fa28a6ec86734fb45ba17c67f361b9653257fa41b75e61" }, "downloads": -1, "filename": "nlpkit_ml-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "8289d9f02cfdc56c33f7f90981ba2c7e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 15212, "upload_time": "2019-09-07T10:59:04", "url": "https://files.pythonhosted.org/packages/bf/ca/25e7d00c385fb4768e327b3fe6930fdf8fcfd843e459e40aae8441acab35/nlpkit_ml-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "36856eb70e6de978ecd881c383c620c9", "sha256": "5fdbdcf61a65e755bf2623a9a660e20581fd9ab78d64806b3b53f8c636368729" }, "downloads": -1, "filename": "nlpkit-ml-0.0.1.tar.gz", "has_sig": false, "md5_digest": "36856eb70e6de978ecd881c383c620c9", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 3037, "upload_time": "2019-09-07T10:59:07", "url": "https://files.pythonhosted.org/packages/cb/0a/060f887d285225e98120be305897bba0e61b315422caeaadf91985e3b87b/nlpkit-ml-0.0.1.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "0d8b88cf5df25e902d1427d38f8aa6c6", "sha256": "8801012ccfb2e2becbf0ba6dc3459d99072fb8feb6e8ca9f8acd2b33c00bd6eb" }, "downloads": -1, "filename": "nlpkit_ml-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0d8b88cf5df25e902d1427d38f8aa6c6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 36256, "upload_time": "2019-09-09T21:48:00", "url": "https://files.pythonhosted.org/packages/d1/1d/b9af15783576505339ff83db35da69884aeab7a4745cb4d30ce0e9a63ca1/nlpkit_ml-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "183be783d6f7f19439f2772d78af806f", "sha256": "bcf73a58a599b910351cb745426472995c8be33a65b3f6e9e2131164703975a1" }, "downloads": -1, "filename": "nlpkit-ml-1.0.0.tar.gz", "has_sig": false, "md5_digest": "183be783d6f7f19439f2772d78af806f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14610, "upload_time": "2019-09-09T21:48:02", "url": "https://files.pythonhosted.org/packages/71/ac/ae32d441ee0a8d1aa49aeefa0c39fffb43ae6ddc3ab749a270e1ac1f8a6a/nlpkit-ml-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0d8b88cf5df25e902d1427d38f8aa6c6", "sha256": "8801012ccfb2e2becbf0ba6dc3459d99072fb8feb6e8ca9f8acd2b33c00bd6eb" }, "downloads": -1, "filename": "nlpkit_ml-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0d8b88cf5df25e902d1427d38f8aa6c6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 36256, "upload_time": "2019-09-09T21:48:00", "url": "https://files.pythonhosted.org/packages/d1/1d/b9af15783576505339ff83db35da69884aeab7a4745cb4d30ce0e9a63ca1/nlpkit_ml-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "183be783d6f7f19439f2772d78af806f", "sha256": "bcf73a58a599b910351cb745426472995c8be33a65b3f6e9e2131164703975a1" }, "downloads": -1, "filename": "nlpkit-ml-1.0.0.tar.gz", "has_sig": false, "md5_digest": "183be783d6f7f19439f2772d78af806f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14610, "upload_time": "2019-09-09T21:48:02", "url": "https://files.pythonhosted.org/packages/71/ac/ae32d441ee0a8d1aa49aeefa0c39fffb43ae6ddc3ab749a270e1ac1f8a6a/nlpkit-ml-1.0.0.tar.gz" } ] }