{ "info": { "author": "Pascal van Kooten", "author_email": "kootenpv@gmail.com", "bugtrack_url": null, "classifiers": [ "Environment :: Console", "Intended Audience :: Customer Service", "Intended Audience :: Developers", "Intended Audience :: System Administrators", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft", "Operating System :: POSIX", "Operating System :: Unix", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Software Development", "Topic :: Software Development :: Build Tools", "Topic :: Software Development :: Debuggers", "Topic :: Software Development :: Libraries", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: System :: Software Distribution", "Topic :: System :: Systems Administration", "Topic :: Utilities" ], "description": "## tok\n\n[![PyPI](https://img.shields.io/pypi/v/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/)\n[![PyPI](https://img.shields.io/pypi/pyversions/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/)\n\nFast and most complete/customizable tokenizer in Python.\n\nIt is roughly 25x faster than spacy's and nltk's regex based tokenizers.\n\nUsing the aho-corasick algorithm makes it a novelty and allows it to be both explainable and fast in how it will split.\n\nThe heavy lifting is done by [textsearch](https://github.com/kootenpv/textsearch) and [pyahocorasick](https://github.com/WojciechMula/pyahocorasick), allowing this to be written in only ~200 lines of code.\n\nContrary to regex-based approaches, it will go over each character in a text only once. Read [below](#how-it-works) about how this works.\n\n### Installation\n\n pip install tok\n\n### Usage\n\nBy default it handles contractions, http, (float) numbers and currencies.\n\n```python\nfrom tok import word_tokenize\nword_tokenize(\"I wouldn't do that.... would you?\")\n['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?']\n```\n\nOr configure it yourself:\n\n```python\nfrom tok import Tokenizer\ntokenizer = Tokenizer(protected_words=[\"some.thing\"]) # still using the defaults\ntokenizer.word_tokenize(\"I want to protect some.thing\")\n['I', 'want', 'to', 'protect', 'some.thing']\n```\n\nSplit by sentences:\n\n```python\nfrom tok import sent_tokenize\nsent_tokenize(\"I wouldn't do that.... would you?\")\n[['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']]\n```\n\nfor more options check the documentation of the `Tokenizer`.\n\n### Further customization\n\nGiven:\n\n```python\nfrom tok import Tokenizer\nt = Tokenizer(protected_words=[\"some.thing\"]) # still using the defaults\n```\n\nYou can add your own ideas to the tokenizer by using:\n\n- `t.keep(x, reason)`: Whenever it finds x, it will not add whitespace. Prevents direct tokenization.\n- `t.split(x, reason)`: Whenever it finds x, it will surround it by whitespace, thus creating a token.\n- `t.drop(x, reason)`: Whenever it finds x, it will remove it but add a split.\n- `t.strip(x, reason)`: Whenever it finds x, it will remove it without splitting.\n\n```python\nt.drop(\"bla\", \"bla is not needed\")\nt.word_tokenize(\"Please remove bla, thank you\")\n['Please', 'remove', ',', 'thank', 'you']\n```\n\n### Explainable\n\nExplain what happened:\n\n```python\nt.explain(\"bla\")\n[{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}]\n```\n\nSee everything in there (will help you understand how it works):\n\n```python\nt.explain_dict\n```\n\n### How it works\n\nIt will always only keep the longest match. By introducing a space in your tokens, it will make it be split.\n\nIf you consider how the tokenization of `.` works, see here:\n\n- When it finds a ` A.` it will make it ` A.` (single letter abbreviations)\n- When it finds a `.0` it will make it `.0` (numbers)\n- When it finds a `.`, it will make it ` . ` (thus making a split)\n\nIf you want to make sure something including a dot stays, you can use for example:\n\n t.keep(\"cool.\")\n\n### Contributing\n\nIt would be greatly appreciated if you want to contribute to this library.\n\nIt would also be great to add [contractions](https://github.com/kootenpv/contractions) for other languages.\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kootenpv/tok", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "tok", "package_url": "https://pypi.org/project/tok/", "platform": "any", "project_url": "https://pypi.org/project/tok/", "project_urls": { "Homepage": "https://github.com/kootenpv/tok" }, "release_url": "https://pypi.org/project/tok/0.1.14/", "requires_dist": null, "requires_python": "", "summary": "Fast and customizable tokenizer", "version": "0.1.14" }, "last_serial": 5508509, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "04d8c87fa35b2bab812efc41f47adc87", "sha256": "3ec602fa3914c899312076c9d65a353347e5d86e2a10cc3da1ad9e7553cbf3ba" }, "downloads": -1, "filename": "tok-0.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "04d8c87fa35b2bab812efc41f47adc87", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7150, "upload_time": "2019-07-04T18:45:40", "url": "https://files.pythonhosted.org/packages/39/f0/54000e66ec52d42642c0121aa6edc76d353ea1df27d322cb203720c42dbb/tok-0.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "df8ba5d8c06cfbd8378a0eea2daac304", "sha256": "a9467486a0099dc5c4d2602fe2a404aa12af0993735f51c4eaa49038f2c7c941" }, "downloads": -1, "filename": "tok-0.0.1.tar.gz", "has_sig": false, "md5_digest": "df8ba5d8c06cfbd8378a0eea2daac304", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4878, "upload_time": "2019-07-04T18:45:36", "url": "https://files.pythonhosted.org/packages/09/c4/5b173be87aa13a98b1386ab75730a138aaebc798e51b7cc80fabec31b1e9/tok-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "3e45fb5e3dbf58a79f0d28a6c0a35f05", "sha256": "8c58a6669447d39846842ddda9e02551f4faafe0658bfa2dadf5d74dcaceaca2" }, "downloads": -1, "filename": "tok-0.0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "3e45fb5e3dbf58a79f0d28a6c0a35f05", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7174, "upload_time": "2019-07-04T18:49:52", "url": "https://files.pythonhosted.org/packages/13/fc/52c686f862139288f59b8398cc86b207bac39d0e82e30f3f0d8f9f945613/tok-0.0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "96af9666dafc361cef4e9fce3ac7ce1d", "sha256": "e0d70f5a5e7a80053c5c2c8a7cae5aef8ce0eae9a1b5085a70041a610e9c5c4c" }, "downloads": -1, "filename": "tok-0.0.2.tar.gz", "has_sig": false, "md5_digest": "96af9666dafc361cef4e9fce3ac7ce1d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4906, "upload_time": "2019-07-04T18:49:50", "url": "https://files.pythonhosted.org/packages/0e/98/c28fa7619aedbe29f9b4f496ae6da8b21911f2f1c73a71d25c83285a262a/tok-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "e61222ccc1be0a9497c0edae91e4ea92", "sha256": "4ec1c5b87c70baa0befdd4c2a0d39c52951ec99ab047617a9d3f4329b5c0141c" }, "downloads": -1, "filename": "tok-0.0.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "e61222ccc1be0a9497c0edae91e4ea92", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7181, "upload_time": "2019-07-04T18:51:47", "url": "https://files.pythonhosted.org/packages/5d/d3/a0f853e7ef5c3b7a070b3e77e4af9b7f70023a821cad8db765fd3ff94127/tok-0.0.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5107ca355abfc444e46609ce80ebb747", "sha256": "66ab3a3ad8003e266540f90d5464f44a38a63a50dfbc7e3062e89d970a65f557" }, "downloads": -1, "filename": "tok-0.0.3.tar.gz", "has_sig": false, "md5_digest": "5107ca355abfc444e46609ce80ebb747", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4921, "upload_time": "2019-07-04T18:51:39", "url": "https://files.pythonhosted.org/packages/8e/34/45c1b0a4028e4ea8c54561e18ebcf96045b3cd1df998ed6ebb0fba0305ca/tok-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "b7667ccbe4cf0f641688c688e2b74dd3", "sha256": "80f69230df5312b521e841cf6d5773bf3303310c91056dcc4af9ebbf9d5099f4" }, "downloads": -1, "filename": "tok-0.0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b7667ccbe4cf0f641688c688e2b74dd3", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7317, "upload_time": "2019-07-04T19:00:26", "url": "https://files.pythonhosted.org/packages/4f/91/696e615b83116b20a82cf7e27adb6e8e518f5fec406344d271f31d491aad/tok-0.0.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "895ca795f74e83f5b5c8cc69e7c1c849", "sha256": "7f31c5e11affa9398296f1e80059b7ef5c24ab3d523c75a4c77c9d2917b6456d" }, "downloads": -1, "filename": "tok-0.0.4.tar.gz", "has_sig": false, "md5_digest": "895ca795f74e83f5b5c8cc69e7c1c849", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5070, "upload_time": "2019-07-04T19:00:23", "url": "https://files.pythonhosted.org/packages/91/9f/2e8294c6ce69db27f6562729cd4378262e2f99ca05b06d2d6463c0db70dc/tok-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "12c56c36a7543ebde78f714110fbfcc6", "sha256": "5b344d235f3a846b6184745d6554faca56ede29561ab9f744194a244b2180eb1" }, "downloads": -1, "filename": "tok-0.0.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "12c56c36a7543ebde78f714110fbfcc6", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7752, "upload_time": "2019-07-04T19:32:55", "url": "https://files.pythonhosted.org/packages/1a/bd/6e8b45b1b75cd079184bf3d187fe58641c5c7f031531b1f7a53a0104c272/tok-0.0.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b6f26e593029486a840041873bb8de7a", "sha256": "d1c94f070834f68a5a2d0a3f2c7fd8c2fd446a2db6a072c33fec911b41b7e79d" }, "downloads": -1, "filename": "tok-0.0.5.tar.gz", "has_sig": false, "md5_digest": "b6f26e593029486a840041873bb8de7a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5558, "upload_time": "2019-07-04T19:32:53", "url": "https://files.pythonhosted.org/packages/70/ec/a7ce5c691a0ae94e81533e198e169f9ebb89b3540c6c7d7583058872a375/tok-0.0.5.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "b1592eb223461b6446c7eb5fcc7d4fd8", "sha256": "d0457ee1d92ee32ae8130cccc477d6153900bdbb90af203b4782fb2a3a247e99" }, "downloads": -1, "filename": "tok-0.0.8-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b1592eb223461b6446c7eb5fcc7d4fd8", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7774, "upload_time": "2019-07-04T19:41:58", "url": "https://files.pythonhosted.org/packages/45/19/253caf7a2974ebe9ca758b7430b49df3ece421bd5b0a7e607dff62c99f30/tok-0.0.8-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f85d4ee9c2de1592b842bb1c2f89d84f", "sha256": "ec068157cfc8aeaf91f8cc0c941af1ceb2322624838651c7a9a8b502e85cdf9c" }, "downloads": -1, "filename": "tok-0.0.8.tar.gz", "has_sig": false, "md5_digest": "f85d4ee9c2de1592b842bb1c2f89d84f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5572, "upload_time": "2019-07-04T19:41:56", "url": "https://files.pythonhosted.org/packages/a5/ac/7246231417785cbeb7a3ae4a62677e9cf63c1eea6ed5e33ce5642f4c5602/tok-0.0.8.tar.gz" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "02ba78e6a480b7d7b757efa9da67f81f", "sha256": "ed49c78323dd41813e7f575302fa08f995f3bf9a39915e38e5777d6bb75b239f" }, "downloads": -1, "filename": "tok-0.0.9-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "02ba78e6a480b7d7b757efa9da67f81f", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7732, "upload_time": "2019-07-04T19:46:05", "url": "https://files.pythonhosted.org/packages/ab/b9/46c7f2e73af71aef33f58b947c4f6ea39e271a21c540da2ae8f135e0ff62/tok-0.0.9-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4614e6c9b0864606de40800d0bd9b214", "sha256": "db960aae5ae52fd3a5c27b904b09f63f263b97ad622ecb51ffbd643659cd9b1c" }, "downloads": -1, "filename": "tok-0.0.9.tar.gz", "has_sig": false, "md5_digest": "4614e6c9b0864606de40800d0bd9b214", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5528, "upload_time": "2019-07-04T19:46:03", "url": "https://files.pythonhosted.org/packages/3c/e0/4a011a019db95937a589b3fc54f7dd22848b31a3ee3adaa3b8c6c6661226/tok-0.0.9.tar.gz" } ], "0.1.13": [ { "comment_text": "", "digests": { "md5": "498f350ee117d47f5589a97ad079801a", "sha256": "31c308cb15e9f92b31d6d53b5e499e26063da77896849fa990dc5debfe0dcba8" }, "downloads": -1, "filename": "tok-0.1.13-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "498f350ee117d47f5589a97ad079801a", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7757, "upload_time": "2019-07-09T18:44:15", "url": "https://files.pythonhosted.org/packages/57/dd/37b39f8ade4446008b143ee1c412cdd0dd1cb325f8888dbf679f1e07ab4b/tok-0.1.13-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ed7513c149280073b354a79863215c34", "sha256": "2cf655fd6a25525fb1b19083d26721f2cf615c843649de76b78631b433e6627d" }, "downloads": -1, "filename": "tok-0.1.13.tar.gz", "has_sig": false, "md5_digest": "ed7513c149280073b354a79863215c34", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5546, "upload_time": "2019-07-09T18:44:13", "url": "https://files.pythonhosted.org/packages/bb/95/07f8b27660d78024adef78de21c1add055786e0d1c700937f6372a83af59/tok-0.1.13.tar.gz" } ], "0.1.14": [ { "comment_text": "", "digests": { "md5": "dbc3d314dd0bc7451734a29e88b1ed4e", "sha256": "8c3454e8968871808e4dd6d7d86371b11d7d5dea1b92b260016392750721165f" }, "downloads": -1, "filename": "tok-0.1.14-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "dbc3d314dd0bc7451734a29e88b1ed4e", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7760, "upload_time": "2019-07-09T18:48:29", "url": "https://files.pythonhosted.org/packages/a3/96/878cb1996aa90ff3f714205e5ba8d4a891a57463ad54594e553e39fd2691/tok-0.1.14-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e8b91768944ea55e6426380623170c5a", "sha256": "6c48d5b77c8e4e9e6e8413827d6f0ba8c058ce5578198b2ee68901a07fbc2e57" }, "downloads": -1, "filename": "tok-0.1.14.tar.gz", "has_sig": false, "md5_digest": "e8b91768944ea55e6426380623170c5a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5556, "upload_time": "2019-07-09T18:48:26", "url": "https://files.pythonhosted.org/packages/fb/e8/6798c017485aa2e0713b9f4fdae679e91940b3927839b8d2866820c1b994/tok-0.1.14.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "dbc3d314dd0bc7451734a29e88b1ed4e", "sha256": "8c3454e8968871808e4dd6d7d86371b11d7d5dea1b92b260016392750721165f" }, "downloads": -1, "filename": "tok-0.1.14-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "dbc3d314dd0bc7451734a29e88b1ed4e", "packagetype": "bdist_wheel", "python_version": "3.7", "requires_python": null, "size": 7760, "upload_time": "2019-07-09T18:48:29", "url": "https://files.pythonhosted.org/packages/a3/96/878cb1996aa90ff3f714205e5ba8d4a891a57463ad54594e553e39fd2691/tok-0.1.14-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e8b91768944ea55e6426380623170c5a", "sha256": "6c48d5b77c8e4e9e6e8413827d6f0ba8c058ce5578198b2ee68901a07fbc2e57" }, "downloads": -1, "filename": "tok-0.1.14.tar.gz", "has_sig": false, "md5_digest": "e8b91768944ea55e6426380623170c5a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5556, "upload_time": "2019-07-09T18:48:26", "url": "https://files.pythonhosted.org/packages/fb/e8/6798c017485aa2e0713b9f4fdae679e91940b3927839b8d2866820c1b994/tok-0.1.14.tar.gz" } ] }