{ "info": { "author": "Joseph Sefara", "author_email": "sefaratj@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Text Processing :: Linguistic" ], "description": "# [TextAugment: Improving short text classification through global augmentation methods](https://arxiv.org/abs/1907.03752) \n\nTextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of [NLTK](https://www.nltk.org/), [Gensim](https://radimrehurek.com/gensim/), and [TextBlob](https://textblob.readthedocs.io/) and plays nicely with them.\n\n## Citation Paper\n\n**[Improving short text classification through global augmentation methods](https://arxiv.org/abs/1907.03752)** published to [MLDM 2019](http://mldm.de)\n\n\n\n### Requirements\n\n* Python 3\n\nThe following software packages are dependencies and will be installed automatically.\n\n```shell\n$ pip install numpy nltk gensim textblob googletrans \n\n```\nThe following code downloads NLTK corpus for [wordnet](http://www.nltk.org/howto/wordnet.html).\n```python\nnltk.download('wordnet')\n```\nThe following code downloads [NLTK tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html). This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. \n```python\nnltk.download('punkt')\n```\nThe following code downloads default [NLTK part-of-speech tagger](https://www.nltk.org/_modules/nltk/tag.html) model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.\n```python\nnltk.download('averaged_perceptron_tagger')\n```\nUse gensim to load a pre-trained word2vec model. Like [Google News from Google drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).\n```python\nimport gensim\nmodel = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)\n```\nOr training one from scratch using your data or the following public dataset:\n\n- [Text8 Wiki](http://mattmahoney.net/dc/enwik9.zip)\n\n- [Dataset from \"One Billion Word Language Modeling Benchmark\"](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)\n\n### Installation\n\nInstall from pip [Recommended] \n```sh\n$ pip install textaugment\nor install latest release\n$ pip install git+git@github.com:dsfsi/textaugment.git\n```\n\nInstall from source\n```sh\n$ git clone git@github.com:dsfsi/textaugment.git\n$ cd textaugment\n$ python setup.py install\n```\n\n### How to use\n\nThere are three types of augmentations which can be used:\n\n- word2vec \n\n```python\nfrom textaugment import Word2vec\n```\n\n- wordnet \n```python\nfrom textaugment import Wordnet\n```\n- translate (This will require internet access)\n```python\nfrom textaugment import Translate\n```\n#### Word2vec-based augmentation\n**Basic example**\n```python\n>>> from textaugment import Word2vec\n>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')\n>>> t.augment('The stories are good')\nThe films are good\n```\n**Advanced example**\n\n```python\n>>> runs = 1 # By default.\n>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)\n>>> p = 0.5 # The probability of success of an individual trial. (0.1
>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)\n>>> t.augment('The stories are good')\nThe movies are excellent\n```\n#### WordNet-based augmentation\n**Basic example**\n```python\n>>> import nltk\n>>> nltk.download('punkt')\n>>> nltk.download('wordnet')\n>>> from textaugment import Wordnet\n>>> t = Wordnet()\n>>> t.augment('In the afternoon, John is going to town')\nIn the afternoon, John is walking to town\n```\n**Advanced example**\n\n```python\n>>> v = True # enable verbs augmentation. By default is True.\n>>> n = False # enable nouns augmentation. By default is False.\n>>> runs = 1 # number of times to augment a sentence. By default is 1.\n>>> p = 0.5 # The probability of success of an individual trial. (0.1
>> t = Wordnet(v=False ,n=True, p=0.5)\n>>> t.augment('In the afternoon, John is going to town')\nIn the afternoon, Joseph is going to town.\n```\n#### RTT-based augmentation\n**Example**\n```python\n>>> src = \"en\" # source language of the sentence\n>>> to = \"fr\" # target language\n>>> from textaugment import Translate\n>>> t = Translate(src=\"en\", to=\"fr\")\n>>> t.augment('In the afternoon, John is going to town')\nIn the afternoon John goes to town\n```\n## Built with \u2764 on\n* [Python](http://python.org/)\n\n## Authors\n* [Joseph Sefara](https://za.linkedin.com/in/josephsefara) (http://www.speechtech.co.za)\n* [Vukosi Marivate](http://www.vima.co.za) (http://www.vima.co.za)\n\n## Acknowledgements\nCite this [paper](https://arxiv.org/abs/1907.03752) when using this library.\n\n## Licence\nMIT licensed. See the bundled [LICENCE](LICENCE) file for more details.\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dsfsi/textaugment", "keywords": "text augmentation,python,natural language processing,nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "textaugment", "package_url": "https://pypi.org/project/textaugment/", "platform": "", "project_url": "https://pypi.org/project/textaugment/", "project_urls": { "Homepage": "https://github.com/dsfsi/textaugment" }, "release_url": "https://pypi.org/project/textaugment/1.1/", "requires_dist": [ "nltk", "gensim", "textblob", "numpy", "googletrans" ], "requires_python": "", "summary": "A library for augmenting text for natural language processing applications.", "version": "1.1" }, "last_serial": 5535452, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "9b1372cb4cafdf67948b7d0f2330b7c1", "sha256": "c210ae4b50764cb17ddf91fedb197fba988793731c9f4445448ef3a77dd70957" }, "downloads": -1, "filename": "textaugment-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "9b1372cb4cafdf67948b7d0f2330b7c1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10258, "upload_time": "2019-07-15T12:38:26", "url": "https://files.pythonhosted.org/packages/ef/88/372549739a6dfa4fe9f0b0e6247c7e3b861a6c3c00504f6e96ddec9e25b5/textaugment-1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b6aed57fe8bad23d008587110db93074", "sha256": "44316631883effe6c3a76d19b5e908c1afaf637ed8696d8cbc1a0d2037c72e78" }, "downloads": -1, "filename": "textaugment-1.0.tar.gz", "has_sig": false, "md5_digest": "b6aed57fe8bad23d008587110db93074", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10287, "upload_time": "2019-07-15T12:38:28", "url": "https://files.pythonhosted.org/packages/9e/fc/557b6a1ec8fb5095b1d9f1f3c3fffeb59404b84ef66ea6d5398b79bba242/textaugment-1.0.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "f6d4ce092f907799dcfe6419a264e6cb", "sha256": "65c2d014dab8f4457f5998c0b2f3e04a7ae0e717cb81d0ab5f0f59b760133e1b" }, "downloads": -1, "filename": "textaugment-1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f6d4ce092f907799dcfe6419a264e6cb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11116, "upload_time": "2019-07-15T15:02:25", "url": "https://files.pythonhosted.org/packages/8d/a0/c48647d04668f3b7cec8e9504058a959709251f2cc5dd4a8df4d62b2a638/textaugment-1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7b5ef3c9efd1a78259788015ffb455c7", "sha256": "6d0ecca10cafc6e73d3f0b3b78beeb62b4db8f1527f026feeaa8e19ca986f7c6" }, "downloads": -1, "filename": "textaugment-1.1.tar.gz", "has_sig": false, "md5_digest": "7b5ef3c9efd1a78259788015ffb455c7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10368, "upload_time": "2019-07-15T15:02:26", "url": "https://files.pythonhosted.org/packages/7e/42/1f7b29274fed9242080fcb31dc52d5b67cf9578370fd8d783959f7cfbd4e/textaugment-1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f6d4ce092f907799dcfe6419a264e6cb", "sha256": "65c2d014dab8f4457f5998c0b2f3e04a7ae0e717cb81d0ab5f0f59b760133e1b" }, "downloads": -1, "filename": "textaugment-1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f6d4ce092f907799dcfe6419a264e6cb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11116, "upload_time": "2019-07-15T15:02:25", "url": "https://files.pythonhosted.org/packages/8d/a0/c48647d04668f3b7cec8e9504058a959709251f2cc5dd4a8df4d62b2a638/textaugment-1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7b5ef3c9efd1a78259788015ffb455c7", "sha256": "6d0ecca10cafc6e73d3f0b3b78beeb62b4db8f1527f026feeaa8e19ca986f7c6" }, "downloads": -1, "filename": "textaugment-1.1.tar.gz", "has_sig": false, "md5_digest": "7b5ef3c9efd1a78259788015ffb455c7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10368, "upload_time": "2019-07-15T15:02:26", "url": "https://files.pythonhosted.org/packages/7e/42/1f7b29274fed9242080fcb31dc52d5b67cf9578370fd8d783959f7cfbd4e/textaugment-1.1.tar.gz" } ] }