{
    "info": {
        "author": "Marat Zaynutdinov",
        "author_email": "tsundokum@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: Apache Software License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "# ru_sent_tokenize\nA simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.\n\n# Installation\n```\npip install rusenttokenize\n```\n\n# Running\n```ipython\n>>> from rusenttokenize import ru_sent_tokenize\n>>> ru_sent_tokenize('\u042d\u0442\u0430 \u0448\u043e\u043a\u043e\u043b\u0430\u0434\u043a\u0430 \u0437\u0430 400\u0440. \u043d\u0438\u0447\u0435\u0433\u043e \u0438\u0437 \u0441\u0435\u0431\u044f \u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u043b\u0430. \u0410\u0440\u0442\u0451\u043c \u0440\u0435\u0448\u0438\u043b \u0431\u043e\u043b\u044c\u0448\u0435 \u043d\u0435 \u0445\u043e\u0434\u0438\u0442\u044c \u0432 \u044d\u0442\u043e\u0442 \u043c\u0430\u0433\u0430\u0437\u0438\u043d')\n['\u042d\u0442\u0430 \u0448\u043e\u043a\u043e\u043b\u0430\u0434\u043a\u0430 \u0437\u0430 400\u0440. \u043d\u0438\u0447\u0435\u0433\u043e \u0438\u0437 \u0441\u0435\u0431\u044f \u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u043b\u0430.', '\u0410\u0440\u0442\u0451\u043c \u0440\u0435\u0448\u0438\u043b \u0431\u043e\u043b\u044c\u0448\u0435 \u043d\u0435 \u0445\u043e\u0434\u0438\u0442\u044c \u0432 \u044d\u0442\u043e\u0442 \u043c\u0430\u0433\u0430\u0437\u0438\u043d']\n```\n\n# Metrics\n\nThe tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics. \n\nPrecision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.  \n\nRecall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.\n\n<table>\n  <tr>\n    <th rowspan=2>tokenizer</th>\n    <th colspan=3>OpenCorpora</th>\n    <th colspan=3>SynTagRus</th>\n  </tr>\n  <tr>\n    <th>Precision</th>\n    <th>Recall</th>\n    <th>Execution Time (sec)</th>\n    <th>Precision</th>\n    <th>Recall</th>\n    <th>Execution Time (sec)</th>\n  </tr>\n  <tbody>\n    <tr>\n      <td>nltk.sent_tokenize</td>\n      <td>94.30</td>\n      <td>86.06</td>\n      <td>8.67</td>\n      <td>98.15</td>\n      <td>94.95</td>\n      <td>5.07</td>\n    </tr>\n    <tr>\n      <td>nltk.sent_tokenize(x, language='russian')</td>\n      <td>95.53</td>\n      <td>88.37</td>\n      <td>8.54</td>\n      <td>98.44</td>\n      <td>95.45</td>\n      <td>5.68</td>\n    </tr>\n    <tr>\n      <td>bureaucratic-labs.segmentator.split</td>\n      <td>97.16</td>\n      <td>88.62</td>\n      <td>359</td>\n      <td>96.79</td>\n      <td>92.55</td>\n      <td>210</td>\n    </tr>\n    <tr>\n      <td>ru_sent_tokenize</td>\n      <td>98.73</td>\n      <td>93.45</td>\n      <td>4.92</td>\n      <td>99.81</td>\n      <td>98.59</td>\n      <td>2.87</td>\n    </tr>\n  </tbody>\n</table>\n\n[Notebook](https://github.com/deepmipt/ru_sentence_tokenizer/blob/master/metrics/calculate.ipynb) shows how the table above was calculated \n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/deepmipt/ru_sentence_tokenizer",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "rusenttokenize",
        "package_url": "https://pypi.org/project/rusenttokenize/",
        "platform": "",
        "project_url": "https://pypi.org/project/rusenttokenize/",
        "project_urls": {
            "Homepage": "https://github.com/deepmipt/ru_sentence_tokenizer"
        },
        "release_url": "https://pypi.org/project/rusenttokenize/0.0.5/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Rule-based sentence tokenizer for Russian language",
        "version": "0.0.5"
    },
    "last_serial": 4030291,
    "releases": {
        "0.0.4": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "69bccf218a188db463439c1db0f87ad9",
                    "sha256": "d0c4d3c373820b7b2c5ca21bf82c28d38e79793bed8c8766687e96fdc8b4a9d1"
                },
                "downloads": -1,
                "filename": "rusenttokenize-0.0.4-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "69bccf218a188db463439c1db0f87ad9",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 10003,
                "upload_time": "2018-07-04T12:30:07",
                "url": "https://files.pythonhosted.org/packages/f8/4c/e2aeee9cdcc266303289ad8c4acfdcf401781646bcc311ee2bf18f84d663/rusenttokenize-0.0.4-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "d7a250e3d8dd0f8cc1714262901d9952",
                    "sha256": "42982eb4d278f02e108bde088549c7b8059fecb766691eb1c0fbbb1d57cf85c7"
                },
                "downloads": -1,
                "filename": "rusenttokenize-0.0.4.tar.gz",
                "has_sig": false,
                "md5_digest": "d7a250e3d8dd0f8cc1714262901d9952",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5769,
                "upload_time": "2018-07-04T12:30:08",
                "url": "https://files.pythonhosted.org/packages/26/e7/ec1fa9372d09da72a7593ba951f70bd879df9856f64fb48261f36f78a2fe/rusenttokenize-0.0.4.tar.gz"
            }
        ],
        "0.0.5": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "0af470fc385d8a444f3dcae5dfb01561",
                    "sha256": "fcd604d6bc26334d46f87be1b0cd68022650c0a5dc613a39acf9d9da074d9f6b"
                },
                "downloads": -1,
                "filename": "rusenttokenize-0.0.5-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "0af470fc385d8a444f3dcae5dfb01561",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 10484,
                "upload_time": "2018-07-04T14:28:56",
                "url": "https://files.pythonhosted.org/packages/25/4c/a2f00be5def774a3df2e5387145f1cb54e324607ec4a7e23f573645946e7/rusenttokenize-0.0.5-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "9058f7d375e4c18278c3733e8dd10100",
                    "sha256": "b061b0ea40e880558dfe35a0040010c021007e1779517b25c8d47ba145c028c3"
                },
                "downloads": -1,
                "filename": "rusenttokenize-0.0.5.tar.gz",
                "has_sig": false,
                "md5_digest": "9058f7d375e4c18278c3733e8dd10100",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5946,
                "upload_time": "2018-07-04T14:28:57",
                "url": "https://files.pythonhosted.org/packages/6d/76/1226e1ddc11ad492a191664a4926c607bcbf1e5b352134ca6f83c4af8205/rusenttokenize-0.0.5.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "0af470fc385d8a444f3dcae5dfb01561",
                "sha256": "fcd604d6bc26334d46f87be1b0cd68022650c0a5dc613a39acf9d9da074d9f6b"
            },
            "downloads": -1,
            "filename": "rusenttokenize-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0af470fc385d8a444f3dcae5dfb01561",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 10484,
            "upload_time": "2018-07-04T14:28:56",
            "url": "https://files.pythonhosted.org/packages/25/4c/a2f00be5def774a3df2e5387145f1cb54e324607ec4a7e23f573645946e7/rusenttokenize-0.0.5-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "9058f7d375e4c18278c3733e8dd10100",
                "sha256": "b061b0ea40e880558dfe35a0040010c021007e1779517b25c8d47ba145c028c3"
            },
            "downloads": -1,
            "filename": "rusenttokenize-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "9058f7d375e4c18278c3733e8dd10100",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5946,
            "upload_time": "2018-07-04T14:28:57",
            "url": "https://files.pythonhosted.org/packages/6d/76/1226e1ddc11ad492a191664a4926c607bcbf1e5b352134ca6f83c4af8205/rusenttokenize-0.0.5.tar.gz"
        }
    ]
}