{ "info": { "author": "Clementine", "author_email": "iclementine@outlook.com", "bugtrack_url": null, "classifiers": [], "description": "pytext\n+++++++++\n\nThis repository consists of:\n\n* `pytext.data <#data>`_: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)\n* `pytext.datasets <#datasets>`_: Pre-built loaders for common NLP datasets\n\nIt is a fork of torchtext, but use numpy ndarray for dataset instead of torch.Tensor or Variable, so as to make it a more generic toolbox for NLP users.\n\nInstallation\n============\n\n\nMake sure you have Python 2.7 or 3.5+ and PyTorch 0.4.0 or newer. You can then install torchtext using pip::\n\n pip install pytexttool\n\n\nOptional requirements\n---------------------\n\nIf you want to use English tokenizer from `SpaCy `_, you need to install SpaCy and download its English model::\n\n pip install spacy\n python -m spacy download en\n\nAlternatively, you might want to use Moses tokenizer from `NLTK `_. You have to install NLTK and download the data needed::\n\n pip install nltk\n python -m nltk.downloader perluniprops nonbreaking_prefixes\n\nData\n====\n\nThe data module provides the following:\n\n* Ability to describe declaratively how to load a custom NLP dataset that's in a \"normal\" format:\n\n .. code-block:: python\n\n >>> pos = data.TabularDataset(\n ... path='data/pos/pos_wsj_train.tsv', format='tsv',\n ... fields=[('text', data.Field()),\n ... ('labels', data.Field())])\n ...\n >>> sentiment = data.TabularDataset(\n ... path='data/sentiment/train.json', format='json',\n ... fields={'sentence_tokenized': ('text', data.Field(sequential=True)),\n ... 'sentiment_gold': ('labels', data.Field(sequential=False))})\n\n* Ability to define a preprocessing pipeline:\n\n .. code-block:: python\n\n >>> src = data.Field(tokenize=my_custom_tokenizer)\n >>> trg = data.Field(tokenize=my_custom_tokenizer)\n >>> mt_train = datasets.TranslationDataset(\n ... path='data/mt/wmt16-ende.train', exts=('.en', '.de'),\n ... fields=(src, trg))\n\n* Batching, padding, and numericalizing (including building a vocabulary object):\n\n .. code-block:: python\n\n >>> # continuing from above\n >>> mt_dev = datasets.TranslationDataset(\n ... path='data/mt/newstest2014', exts=('.en', '.de'),\n ... fields=(src, trg))\n >>> src.build_vocab(mt_train, max_size=80000)\n >>> trg.build_vocab(mt_train, max_size=40000)\n >>> # mt_dev shares the fields, so it shares their vocab objects\n >>>\n >>> train_iter = data.BucketIterator(\n ... dataset=mt_train, batch_size=32,\n ... sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))\n >>> # usage\n >>> next(iter(train_iter))\n \n\n* Wrapper for dataset splits (train, validation, test):\n\n .. code-block:: python\n\n >>> TEXT = data.Field()\n >>> LABELS = data.Field()\n >>>\n >>> train, val, test = data.TabularDataset.splits(\n ... path='/data/pos_wsj/pos_wsj', train='_train.tsv',\n ... validation='_dev.tsv', test='_test.tsv', format='tsv',\n ... fields=[('text', TEXT), ('labels', LABELS)])\n >>>\n >>> train_iter, val_iter, test_iter = data.BucketIterator.splits(\n ... (train, val, test), batch_sizes=(16, 256, 256),\n >>> sort_key=lambda x: len(x.text), device=0)\n >>>\n >>> TEXT.build_vocab(train)\n >>> LABELS.build_vocab(train)\n\nDatasets\n========\n\nThe datasets module currently contains:\n\n* Sentiment analysis: SST and IMDb\n* Question classification: TREC\n* Entailment: SNLI\n* Language modeling: abstract class + WikiText-2\n* Machine translation: abstract class + Multi30k, IWSLT, WMT14\n* Sequence tagging (e.g. POS/NER): abstract class + UDPOS\n\nOthers are planned or a work in progress:\n\n* Question answering: SQuAD\n\nSee the ``test`` directory for examples of dataset usage.\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/iclementine/text", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "pytexttool", "package_url": "https://pypi.org/project/pytexttool/", "platform": "", "project_url": "https://pypi.org/project/pytexttool/", "project_urls": { "Homepage": "https://github.com/iclementine/text" }, "release_url": "https://pypi.org/project/pytexttool/0.3.1/", "requires_dist": [ "requests", "tqdm" ], "requires_python": "", "summary": "Text utilities and datasets for generic deep learning, Fork from torchtext but uses numpy to store datasets for more generic use.", "version": "0.3.1" }, "last_serial": 4206543, "releases": { "0.3.1": [ { "comment_text": "", "digests": { "md5": "46f272a1c6d492eae96273f1384f4b73", "sha256": "abd399bb54948264fb9416048ea0c0fdf967dee98513e82e22862dbb6e3986cd" }, "downloads": -1, "filename": "pytexttool-0.3.1-py2-none-any.whl", "has_sig": false, "md5_digest": "46f272a1c6d492eae96273f1384f4b73", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 60524, "upload_time": "2018-08-25T13:23:49", "url": "https://files.pythonhosted.org/packages/86/1e/e42c90663c9840425a6b640d4808c0507b137b1dcb7c3d2a73f362eb627e/pytexttool-0.3.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1300de3e55bf1f90168f342989b83bf8", "sha256": "5ba8bef67e8698b04374fed400fd4f64b6ce460737bf4228d04a11c3b2ec2c14" }, "downloads": -1, "filename": "pytexttool-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "1300de3e55bf1f90168f342989b83bf8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60518, "upload_time": "2018-08-25T13:23:51", "url": "https://files.pythonhosted.org/packages/84/e2/c7ea7b4b65acd7b63917699cf2a5a9dc6d6430148e0d799cf41b81dc9969/pytexttool-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "36b1174cbf7129ff4b210cc4f2ebd6ee", "sha256": "a0718c0a6ac5a63266ebaff5780eba63e5b01d137b8f740adb3aab593a853915" }, "downloads": -1, "filename": "pytexttool-0.3.1.tar.gz", "has_sig": false, "md5_digest": "36b1174cbf7129ff4b210cc4f2ebd6ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47953, "upload_time": "2018-08-25T13:23:53", "url": "https://files.pythonhosted.org/packages/17/a7/3dae3ef0f428183ccca084fd7a73747c50b6222be0fdbab4a17542a8d416/pytexttool-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "46f272a1c6d492eae96273f1384f4b73", "sha256": "abd399bb54948264fb9416048ea0c0fdf967dee98513e82e22862dbb6e3986cd" }, "downloads": -1, "filename": "pytexttool-0.3.1-py2-none-any.whl", "has_sig": false, "md5_digest": "46f272a1c6d492eae96273f1384f4b73", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 60524, "upload_time": "2018-08-25T13:23:49", "url": "https://files.pythonhosted.org/packages/86/1e/e42c90663c9840425a6b640d4808c0507b137b1dcb7c3d2a73f362eb627e/pytexttool-0.3.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1300de3e55bf1f90168f342989b83bf8", "sha256": "5ba8bef67e8698b04374fed400fd4f64b6ce460737bf4228d04a11c3b2ec2c14" }, "downloads": -1, "filename": "pytexttool-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "1300de3e55bf1f90168f342989b83bf8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60518, "upload_time": "2018-08-25T13:23:51", "url": "https://files.pythonhosted.org/packages/84/e2/c7ea7b4b65acd7b63917699cf2a5a9dc6d6430148e0d799cf41b81dc9969/pytexttool-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "36b1174cbf7129ff4b210cc4f2ebd6ee", "sha256": "a0718c0a6ac5a63266ebaff5780eba63e5b01d137b8f740adb3aab593a853915" }, "downloads": -1, "filename": "pytexttool-0.3.1.tar.gz", "has_sig": false, "md5_digest": "36b1174cbf7129ff4b210cc4f2ebd6ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47953, "upload_time": "2018-08-25T13:23:53", "url": "https://files.pythonhosted.org/packages/17/a7/3dae3ef0f428183ccca084fd7a73747c50b6222be0fdbab4a17542a8d416/pytexttool-0.3.1.tar.gz" } ] }