{ "info": { "author": "Florian Leitner", "author_email": "me@fnl.es", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries", "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic" ], "description": "======\nsyntok\n======\n\n(a.k.a. segtok_ v2)\n\n.. image:: https://img.shields.io/pypi/v/syntok.svg\n :target: https://pypi.python.org/pypi/syntok\n\n.. image:: https://travis-ci.org/fnl/syntok.svg?branch=master\n :target: https://travis-ci.org/fnl/syntok\n\n-------------------------------------------\nSentence segmentation and word tokenization\n-------------------------------------------\n\nThe syntok package provides two modules, ``syntok.segmenter`` and ``syntok.tokenizer``.\nThe tokenizer provides functionality for splitting (Indo-European) text into words and symbols (collectively called *tokens*).\nThe segmenter provides functionality for splitting (Indo-European) token streams (from the tokenizer) into sentences and for pre-processing documents by splitting them into paragraphs.\nBoth modules can also be used from the command-line to split either a given text file (argument) or by reading from STDIN.\nWhile other Indo-European languages could work, it has only been designed with the languages Spanish, English, and German in mind (the author's main languages).\n\n``segtok``\n==========\n\nSyntok is the successor of an earlier, very similar tool, segtok_, but has evolved significantly in terms of providing better segmentation and tokenization performance and throughput (syntok can segment documents at a rate of about 100k tokens per second without problems).\nFor example, if a sentence terminal marker is not followed by a spacing character, segtok is unable to detect that as a terminal marker, while syntok has no problem segmenting that case (as it uses tokenization first, and does segmentation afterwards).\nIn fact, I feel confident enough to just boldly claim syntok is the world's best sentence segmenter for at least English, Spanish, and German.\n\nInstall\n=======\n\nTo use this package, you minimally should have Python 3.5 or installed.\nAs it uses the typing package, earlier versions are not supported.\nThe easiest way to get ``syntok`` installed is using ``pip`` or any other package manager that works with PyPI::\n\n pip3 install syntok\n\n*Important*: If you are on a Linux machine and have problems installing the ``regex`` dependency of ``segtok``, make sure you have the ``python-dev`` and/or ``python3-dev`` packages installed to get the necessary headers to compile that package.\n\nThen try the command line tools on some plain-text files (e.g., this README) to see if ``segtok`` meets your needs::\n\n python3 -m syntok.segmenter README.rst\n python3 -m syntok.tokenizer README.rst\n\nTest Suite\n==========\n\nTo run the test suite, you have to have flake8, pytest, and mypy installed (``pip3 install flake8 pytest mypy``).\n\nThe testing environment works by running ``make`` targets (i.e., you need GNU Make or something equivalent around) or have to call the three commands by hand::\n\n make check\n\n # OR\n flake8 syntok # make lint\n mypy syntok # make type\n pytest syntok # make test\n\nUsage\n=====\n\nFor details, please refer to the code documentation; This README only provides an overview of the provided functionality.\n\nCommand-line\n------------\n\nAfter installing the package, two command-line usages will be available, ``python -m syntok.segmenter`` and ``python -m syntok.tokenizer``.\nEach takes [UTF-8 encoded] plain-text files (or STDIN) as input and transforms that into newline-separated sentences or space-separated tokens, respectively.\nYou can control Python3's file ``open`` encoding by `configuring the environment variable`_ ``PYTHONIOENCODING`` to your needs (e.g. ``export PYTHONIOENCODING=\"utf-16-be\"``).\nThe tokenizer produces single-space separated tokens for each input line.\nThe segmenter produces line-segmented sentences for each input file (or after STDIN closes).\n\n``syntok.tokenizer``\n--------------------\n\nThis module provides the ``Tokenizer`` class to tokenize input text into words and symbols (**value** Tokens), prefixed with (possibly empty) **spacing** strings, while recording their **offset** positions.\nThe Tokenizer comes with utility static functions, to join hyphenated words across line-breaks, and to reproduce the original string from a sequence of tokens.\nThe Tokenizer considers camelCase words as individual tokens (here: camel and Case) and by default considers underscores and Unicode hyphens *inside* words as spacing characters (not Token values).\nIt does not split numeric tokens (without letters) if they contain symbols (e.g. maintaining \"2018-11-11\", \"12:30:21\", \"1_000_000\", \"1,000.00\", or \"1..3\" all as single tokens)\nFinally, as it splits English negation contractions (such as \"don't\") into their root and \"not\" (here: do and not), it can be configured to refrain from replacing this special \"n't\" token with \"not\", and instead emit the actual \"n't\" value.\n\nTo track the spacing and offset of tokens, the module contains the ``Token`` class, which is a ``str`` wrapper class where the token **value** itself is available from the ``value`` property and adding a ``spacing`` and a ``offset`` property that will hold the **spacing** prefix and the **offset** position of the token, respectively.\n\nBasic example::\n\n from syntok.tokenizer import Tokenizer\n\n document = open('README.rst').read()\n tok = Tokenizer() # optional: keep \"n't\" contractions and \"-\", \"_\" inside words as tokens\n\n for token in tok.tokenize(document):\n print(repr(token))\n\n``syntok.segmenter``\n--------------------\n\nThis module provides several functions to segment documents into iterators over paragraphs, sentences, and tokens (functions ``analyze`` and ``process``) or simply sentences and tokens (functions ``split`` and ``segment``).\nThe analytic segmenter can even keep track of the original offset of each token in the document while processing (but does not join hyphen-separated words across line-breaks).\nAll segmenter functions accept arbitrary Token streams as input (typically as generated by the ``Tokenizer.tokenize`` method).\nDue to how ``syntok.tokenizer.Token`` objects \"work\", it is possible to establish the exact sentence content (with the original spacing between the tokens).\nThe pre-processing functions and paragraph-based segmentation splits paragraphs, i.e., chunks of text separated by at least two consecutive linebreaks (``\\\\r?\\\\n``).\n\nBasic example::\n\n import syntok.segmenter as segmenter\n\n document = open('README.rst').read()\n\n # choose the segmentation function you need/prefer\n\n for paragraph in segmenter.process(document):\n for sentence in paragraph:\n for token in sentence:\n # roughly reproduce the input,\n # except for hyphenated word-breaks\n # and replacing \"n't\" contractions with \"not\",\n # separating tokens by single spaces\n print(token.value, end=' ')\n print() # print one sentence per line\n print() # separate paragraphs with newlines\n\n for paragraph in segmenter.analyze(document):\n for sentence in paragraph:\n for token in sentence:\n # exactly reproduce the input\n # and do not remove \"imperfections\"\n print(token.spacing, token.value, sep='', end='')\n print(\"\\n\") # reinsert paragraph separators\n\nLegal\n=====\n\nLicense: `MIT `_\n\nCopyright (c) 2017-2019, Florian Leitner. All rights reserved.\n\n\nHistory\n=======\n\n- **1.2.1** added a generic rule for catching more uncommon uses of \".\" without space suffix as abbreviation marker\n- **1.2.0** added support for skipping and handling text in brackets (e.g., citations)\n- **1.1.1** fixed non-trivial segmentation in sci. text and refactored splitting logic to one place only\n- **1.1.0** added support for ellipses (back - from segtok) in\n- **1.0.2** hyphen joining only should happen when letters are present; squash escape warnings\n- **1.0.1** fixing segmenter.analyze to preserve \"n't\" contractions, and improved the README and Tokenizer constructor API\n- **1.0.0** initial release\n\n.. _segtok: https://github.com/fnl/segtok\n.. _configuring the environment variable: https://docs.python.org/3/using/cmdline.html\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/fnl/syntok", "keywords": "sentence segmenter splitter split word tokenizer token nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "syntok", "package_url": "https://pypi.org/project/syntok/", "platform": "", "project_url": "https://pypi.org/project/syntok/", "project_urls": { "Homepage": "https://github.com/fnl/syntok" }, "release_url": "https://pypi.org/project/syntok/1.2.1/", "requires_dist": null, "requires_python": "", "summary": "sentence segmentation and word tokenization toolkit", "version": "1.2.1" }, "last_serial": 5286649, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "d6c56d62dacbe77af7cae918e132204a", "sha256": "be5a72bb94f785dc193fabc13af129635c1862ddb368823fecd5518351588eeb" }, "downloads": -1, "filename": "syntok-1.0.0-py3.6.egg", "has_sig": false, "md5_digest": "d6c56d62dacbe77af7cae918e132204a", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 31247, "upload_time": "2018-11-14T23:05:45", "url": "https://files.pythonhosted.org/packages/cb/8d/1344b321bd61f66345f331ef32bfbf90adf2339cfb848aecbf64a98d28e1/syntok-1.0.0-py3.6.egg" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "90b70bf2a62548d8cc9480752fc39bca", "sha256": "99533556e55ee8c0286cfa4bf8184fd2c4a6714ea9721a737fe0d9bc8ea5ebe5" }, "downloads": -1, "filename": "syntok-1.0.1.tar.gz", "has_sig": false, "md5_digest": "90b70bf2a62548d8cc9480752fc39bca", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16307, "upload_time": "2018-11-23T12:20:11", "url": "https://files.pythonhosted.org/packages/2d/dc/6f025ee1d8f6988c69f1ae5cead0bc1309a9f8f17d05d3e5ff95c9440668/syntok-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "48ba5b6a5149a88146986d5bc56271d6", "sha256": "8de23974f2520730c2c9416797a78a31552ed098e72d8647db39a199c9a22ad2" }, "downloads": -1, "filename": "syntok-1.0.2.tar.gz", "has_sig": false, "md5_digest": "48ba5b6a5149a88146986d5bc56271d6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16443, "upload_time": "2018-11-23T13:06:32", "url": "https://files.pythonhosted.org/packages/fe/72/373b8f1fb29a0f39a086745bf94a57707a878564bc10255ab524a45eb1ec/syntok-1.0.2.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "6432fab039cb2a5dabad9ff95327a1ff", "sha256": "b986b8c48c2430ce7514e3a7c6df56894758673a10896ef0f6a94236776febb7" }, "downloads": -1, "filename": "syntok-1.1.0.tar.gz", "has_sig": false, "md5_digest": "6432fab039cb2a5dabad9ff95327a1ff", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16656, "upload_time": "2019-02-09T22:07:39", "url": "https://files.pythonhosted.org/packages/9f/a1/b2e18c77d3bd13b7062d3566f46e38182bafe559635dc456d195e20b1ca8/syntok-1.1.0.tar.gz" } ], "1.1.1": [ { "comment_text": "", "digests": { "md5": "c3ce99e9ad6d2c7745e9456fe3fd35ca", "sha256": "c255238f7bdd7c650a9e5032a6f4f290124e4234f9750e764bf4b899f86613b5" }, "downloads": -1, "filename": "syntok-1.1.1.tar.gz", "has_sig": false, "md5_digest": "c3ce99e9ad6d2c7745e9456fe3fd35ca", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20138, "upload_time": "2019-03-28T21:54:26", "url": "https://files.pythonhosted.org/packages/89/c5/863fc93fdef607ecc8b5c4d90386feae3a6359b51cfe3d1f3e5970478ff4/syntok-1.1.1.tar.gz" } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "b30fc7efefca358c6ae0be7ceeaa2e14", "sha256": "c97b4501ae71c88b2beba245ca0d146641ebbecc09fc27d4d02ff03fabbc65db" }, "downloads": -1, "filename": "syntok-1.2.0.tar.gz", "has_sig": false, "md5_digest": "b30fc7efefca358c6ae0be7ceeaa2e14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22164, "upload_time": "2019-05-17T17:23:24", "url": "https://files.pythonhosted.org/packages/05/d2/b948a211c5190f57832d8bfe35842a5fc95d90f7c5101b185f8e7f6a697f/syntok-1.2.0.tar.gz" } ], "1.2.1": [ { "comment_text": "", "digests": { "md5": "5a985aef41326d93884c7402af0ec960", "sha256": "473e18fd93beaca51cae888fcfd0e85156a93c66c1ed836e1cbeee4341e320fa" }, "downloads": -1, "filename": "syntok-1.2.1.tar.gz", "has_sig": false, "md5_digest": "5a985aef41326d93884c7402af0ec960", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22561, "upload_time": "2019-05-18T21:39:54", "url": "https://files.pythonhosted.org/packages/fd/dc/cab829908c5a46e447f21c2f8faea839de4f1fbe4742c6aff83c300b888b/syntok-1.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5a985aef41326d93884c7402af0ec960", "sha256": "473e18fd93beaca51cae888fcfd0e85156a93c66c1ed836e1cbeee4341e320fa" }, "downloads": -1, "filename": "syntok-1.2.1.tar.gz", "has_sig": false, "md5_digest": "5a985aef41326d93884c7402af0ec960", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22561, "upload_time": "2019-05-18T21:39:54", "url": "https://files.pythonhosted.org/packages/fd/dc/cab829908c5a46e447f21c2f8faea839de4f1fbe4742c6aff83c300b888b/syntok-1.2.1.tar.gz" } ] }