{ "info": { "author": "Fedor Kovalev", "author_email": "kovalevfm@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# SubTokenizer\n[![Build status](https://travis-ci.org/kovalevfm/SubTokenizer.svg?master)](https://travis-ci.org/kovalevfm)\n\nSubwords tokenizer based on google code from tensor2tensor. It supports tags and combined tokens in addition to google tokenizer.\n* Tags are tokens starting from `@`, they are not splited on parts.\n* No break symbol `\u00ac` `'\\xac'` allows to join several words in one token.\n\nTokenizer does unicode normalization and controls characters escaping. It's also possible to encode rare symbols so they can be splited on parts by subwords algorithm.\n\nOriginal google subwords tokenizer: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py\n\nBy default before learning and tokenizing SubTokenizer encodes all control characters and `@` `\u00ac` symbols. To use tags it's needed to run encoding first then add tags and after learn/tokenize with `encode_controls=Flase` or `--no_encode_controls` in command line mode.\n\nInstall:\n```bash\n pip install subtokenizer\n```\n\nUsage:\n```bash\ncat text_file.txt | subtokenizer learn -o bpe.file -s 1000 -r reserved_tokens.txt\ncat text_file.txt | subtokenizer tokenize -s bpe.file > tokenized_file.txt\ncat tokenized_file.txt | subtokenizer detokenize -s bpe.file > text_file.txt\n```\nOr:\n```python\nfrom subtokenizer import SubTokenizer\n\ntokenizer = SubTokenizer.learn(words_count)\ntokenizer.save(subwords_filename)\n\ntokenizer = SubTokenizer.load(subwords_filename)\ntokens = tokenizer.tokenize(line)\nline = tokenizer.detokenize(tokens)\n\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kovalevfm/SubTokenizer", "keywords": "nlp tokenization", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "subtokenizer", "package_url": "https://pypi.org/project/subtokenizer/", "platform": "", "project_url": "https://pypi.org/project/subtokenizer/", "project_urls": { "Homepage": "https://github.com/kovalevfm/SubTokenizer" }, "release_url": "https://pypi.org/project/subtokenizer/0.0.19/", "requires_dist": null, "requires_python": ">=2.6, <4.0", "summary": "Subwords tokenizer for neural natural language processing", "version": "0.0.19" }, "last_serial": 5588379, "releases": { "0.0.10": [ { "comment_text": "", "digests": { "md5": "6a0fc0e751c7df72487c46ec41054712", "sha256": "f98e89b6d65bb4853e4bccaaec0640906b6e94fbe8abc5980c36ec5f49f9f0f7" }, "downloads": -1, "filename": "subtokenizer-0.0.10.tar.gz", "has_sig": false, "md5_digest": "6a0fc0e751c7df72487c46ec41054712", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9250, "upload_time": "2018-11-12T12:08:02", "url": "https://files.pythonhosted.org/packages/c3/57/92fa87f0238432e572afc0972e30ac48a8306764bdd7cbce49ba5ddf8235/subtokenizer-0.0.10.tar.gz" } ], "0.0.11": [ { "comment_text": "", "digests": { "md5": "db8c78b61a137cf9d628a9827ee6a6b3", "sha256": "6a63fea9f26a5ade0b978e1c583764620c1ebf59a739eac471ad6de8e745d7bf" }, "downloads": -1, "filename": "subtokenizer-0.0.11.tar.gz", "has_sig": false, "md5_digest": "db8c78b61a137cf9d628a9827ee6a6b3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9270, "upload_time": "2019-02-12T16:44:59", "url": "https://files.pythonhosted.org/packages/83/40/e73e76bc01a54889542828f93ba778567d5854b7f5488a273fe9c3908172/subtokenizer-0.0.11.tar.gz" } ], "0.0.12": [ { "comment_text": "", "digests": { "md5": "ba8460694b46f7096d349fd8e29bdd91", "sha256": "4c4d7b2f21005d4ac09488aaace9fa43287b7b98344d6d4553e0bce0cac29060" }, "downloads": -1, "filename": "subtokenizer-0.0.12.tar.gz", "has_sig": false, "md5_digest": "ba8460694b46f7096d349fd8e29bdd91", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9280, "upload_time": "2019-04-11T13:34:12", "url": "https://files.pythonhosted.org/packages/36/7e/582f177211967368eab3cf4c74342c1569cc067d1ff7ad1967fe7b8af147/subtokenizer-0.0.12.tar.gz" } ], "0.0.13": [ { "comment_text": "", "digests": { "md5": "54751c0b2a7b7c8e0b88bcf313c774c3", "sha256": "f4501195ec02f54af4e62b3e92a9ef02980b397511f4b6e5ab23ccc26edae419" }, "downloads": -1, "filename": "subtokenizer-0.0.13.tar.gz", "has_sig": false, "md5_digest": "54751c0b2a7b7c8e0b88bcf313c774c3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9328, "upload_time": "2019-06-14T11:18:08", "url": "https://files.pythonhosted.org/packages/50/44/18670c0e2943ed38de9d3833d79b0c875a584db40d557bba5931e56f3cd3/subtokenizer-0.0.13.tar.gz" } ], "0.0.14": [ { "comment_text": "", "digests": { "md5": "b77f2a47c46fdf67e99ceecc0c9cb704", "sha256": "9b9b27819e740dc4ecb7357d70902dae2fcaacf8bbaa5fe842b435127a0d8cc0" }, "downloads": -1, "filename": "subtokenizer-0.0.14.tar.gz", "has_sig": false, "md5_digest": "b77f2a47c46fdf67e99ceecc0c9cb704", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9433, "upload_time": "2019-07-19T17:39:11", "url": "https://files.pythonhosted.org/packages/cb/d0/4dd96dad621a20c19b6f1468ab8a63a3bb3d4e0918d94da4e7dbe90d5ea0/subtokenizer-0.0.14.tar.gz" } ], "0.0.15": [ { "comment_text": "", "digests": { "md5": "8c16eabb835259bce30071b4fb7e5493", "sha256": "601a4064c05b392aa6d345a9f60aaec9ef676d7a534a42e369947844bf9561bd" }, "downloads": -1, "filename": "subtokenizer-0.0.15.tar.gz", "has_sig": false, "md5_digest": "8c16eabb835259bce30071b4fb7e5493", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9438, "upload_time": "2019-07-19T17:45:25", "url": "https://files.pythonhosted.org/packages/cf/85/dbfc14b6f54d173b914aeb7e9e1f30bd5884bd5f68d96200879be6c5d53c/subtokenizer-0.0.15.tar.gz" } ], "0.0.16": [ { "comment_text": "", "digests": { "md5": "ae1f6f3f83747736c807c5b3b7afd8d1", "sha256": "77df2d9ff8b2d4381fd0a1b681893b7e1b11e53302fd1c4325f90cea2b50de72" }, "downloads": -1, "filename": "subtokenizer-0.0.16.tar.gz", "has_sig": false, "md5_digest": "ae1f6f3f83747736c807c5b3b7afd8d1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9435, "upload_time": "2019-07-19T18:01:24", "url": "https://files.pythonhosted.org/packages/70/e6/b60e9951a3539d6a1f9714f122546fcc2e10c5fefc90b6114d22aafd7a41/subtokenizer-0.0.16.tar.gz" } ], "0.0.17": [ { "comment_text": "", "digests": { "md5": "f5e0fc17da80be42540c576f7f3890c6", "sha256": "b1e58068e7506104d226a8f1529b54fd9062afbe2cf71e22c1809cd30a29e5bd" }, "downloads": -1, "filename": "subtokenizer-0.0.17.tar.gz", "has_sig": false, "md5_digest": "f5e0fc17da80be42540c576f7f3890c6", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9597, "upload_time": "2019-07-20T09:33:36", "url": "https://files.pythonhosted.org/packages/37/b3/fee9a85593aebe361992a3a552185c79954485b65b69028e067764cbb88d/subtokenizer-0.0.17.tar.gz" } ], "0.0.18": [ { "comment_text": "", "digests": { "md5": "885a998865962d5d04b2d7a53a2a696d", "sha256": "8268d7fc7c0ccbc572a57ced812a1e9c74ce14f4d8abc89113a3471cd5dc9e05" }, "downloads": -1, "filename": "subtokenizer-0.0.18.tar.gz", "has_sig": false, "md5_digest": "885a998865962d5d04b2d7a53a2a696d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9997, "upload_time": "2019-07-20T14:31:00", "url": "https://files.pythonhosted.org/packages/af/38/9a336a10a8b6186d85ed58d924b0007471affe906f4ab9689a870279183d/subtokenizer-0.0.18.tar.gz" } ], "0.0.19": [ { "comment_text": "", "digests": { "md5": "6e5df7aec70d366c89463d3952b63b3f", "sha256": "c1d5bf9f1aa897b08c86a4cacc8be6221f4240281481ffb961e13a067931abc2" }, "downloads": -1, "filename": "subtokenizer-0.0.19.tar.gz", "has_sig": false, "md5_digest": "6e5df7aec70d366c89463d3952b63b3f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 10279, "upload_time": "2019-07-26T10:32:19", "url": "https://files.pythonhosted.org/packages/8d/f9/bbc37011b825802362561fc8d978742592cafad9e366d4b656d0dbc94a5f/subtokenizer-0.0.19.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "3804c7970351b5a497d349f1d714673e", "sha256": "83be92e484850759cf08a9ec3cae7f5a91977349281ca24e79bf4d1b12558d25" }, "downloads": -1, "filename": "subtokenizer-0.0.4.tar.gz", "has_sig": false, "md5_digest": "3804c7970351b5a497d349f1d714673e", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9290, "upload_time": "2018-10-11T10:05:26", "url": "https://files.pythonhosted.org/packages/d5/8a/b93f50b053c0b8d0086de96296e95da2b387dfcc3fceca3e2aadb741e9c8/subtokenizer-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "ee778d849421082112c7591502ce589a", "sha256": "af57192c92226862293e791cece7f3206dc74f8a404e58a85ba7520a4646c77b" }, "downloads": -1, "filename": "subtokenizer-0.0.5.tar.gz", "has_sig": false, "md5_digest": "ee778d849421082112c7591502ce589a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9331, "upload_time": "2018-10-29T11:05:46", "url": "https://files.pythonhosted.org/packages/3f/d7/212cc814950e154c024f640c140b34d3c7c001c6c2468eff1b40a310400f/subtokenizer-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "58d365db8fd2ce8983b1cec2c6fd4a71", "sha256": "490706f0572e3abe731a29e5b286c70609bf423b3768f86474b7cf21091ac6c1" }, "downloads": -1, "filename": "subtokenizer-0.0.6.tar.gz", "has_sig": false, "md5_digest": "58d365db8fd2ce8983b1cec2c6fd4a71", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9348, "upload_time": "2018-10-29T11:15:36", "url": "https://files.pythonhosted.org/packages/4d/2c/502930c61b874d95300ba439944d40aab783570b72be85a10b7daf9d4aec/subtokenizer-0.0.6.tar.gz" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "2376e78d445c9d37c65834eab0f60c91", "sha256": "6761d4d4880688bd947c90bbc093074d6808a67a4b10637350600464118a0ca8" }, "downloads": -1, "filename": "subtokenizer-0.0.7.tar.gz", "has_sig": false, "md5_digest": "2376e78d445c9d37c65834eab0f60c91", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9419, "upload_time": "2018-10-29T11:39:10", "url": "https://files.pythonhosted.org/packages/7f/12/028548296c80b9d6c7efe72d6e5921c03a157676742efcc5cad80e06c18c/subtokenizer-0.0.7.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "d47f17be7821993056d6ece4f156e845", "sha256": "caade9362bed4815b9a9e407c225a90fb7692670d81da2f3cc8d63255047fc7e" }, "downloads": -1, "filename": "subtokenizer-0.0.8.tar.gz", "has_sig": false, "md5_digest": "d47f17be7821993056d6ece4f156e845", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9436, "upload_time": "2018-10-31T11:44:47", "url": "https://files.pythonhosted.org/packages/a0/af/2f453f2795e4d740f4dce8db4316bc6b688568f1883f05e7c77f99bafa5a/subtokenizer-0.0.8.tar.gz" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "bafdd045464607b9ed8e2191c67d2871", "sha256": "300ee90cae37e2df00cc2c6387e773b774faebd31fd7275b5ed9cfe5348d715f" }, "downloads": -1, "filename": "subtokenizer-0.0.9.tar.gz", "has_sig": false, "md5_digest": "bafdd045464607b9ed8e2191c67d2871", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 9211, "upload_time": "2018-11-12T08:44:57", "url": "https://files.pythonhosted.org/packages/48/4c/e420706a71a6b2c2292c775efd6644f25f4627734459c7c9902eaef32a9f/subtokenizer-0.0.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6e5df7aec70d366c89463d3952b63b3f", "sha256": "c1d5bf9f1aa897b08c86a4cacc8be6221f4240281481ffb961e13a067931abc2" }, "downloads": -1, "filename": "subtokenizer-0.0.19.tar.gz", "has_sig": false, "md5_digest": "6e5df7aec70d366c89463d3952b63b3f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, <4.0", "size": 10279, "upload_time": "2019-07-26T10:32:19", "url": "https://files.pythonhosted.org/packages/8d/f9/bbc37011b825802362561fc8d978742592cafad9e366d4b656d0dbc94a5f/subtokenizer-0.0.19.tar.gz" } ] }