{ "info": { "author": "Marat Zaynutdinov", "author_email": "tsundokum@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# ru_sent_tokenize\nA simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.\n\n# Installation\n```\npip install rusenttokenize\n```\n\n# Running\n```ipython\n>>> from rusenttokenize import ru_sent_tokenize\n>>> ru_sent_tokenize('\u042d\u0442\u0430 \u0448\u043e\u043a\u043e\u043b\u0430\u0434\u043a\u0430 \u0437\u0430 400\u0440. \u043d\u0438\u0447\u0435\u0433\u043e \u0438\u0437 \u0441\u0435\u0431\u044f \u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u043b\u0430. \u0410\u0440\u0442\u0451\u043c \u0440\u0435\u0448\u0438\u043b \u0431\u043e\u043b\u044c\u0448\u0435 \u043d\u0435 \u0445\u043e\u0434\u0438\u0442\u044c \u0432 \u044d\u0442\u043e\u0442 \u043c\u0430\u0433\u0430\u0437\u0438\u043d')\n['\u042d\u0442\u0430 \u0448\u043e\u043a\u043e\u043b\u0430\u0434\u043a\u0430 \u0437\u0430 400\u0440. \u043d\u0438\u0447\u0435\u0433\u043e \u0438\u0437 \u0441\u0435\u0431\u044f \u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043b\u044f\u043b\u0430.', '\u0410\u0440\u0442\u0451\u043c \u0440\u0435\u0448\u0438\u043b \u0431\u043e\u043b\u044c\u0448\u0435 \u043d\u0435 \u0445\u043e\u0434\u0438\u0442\u044c \u0432 \u044d\u0442\u043e\u0442 \u043c\u0430\u0433\u0430\u0437\u0438\u043d']\n```\n\n# Metrics\n\nThe tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics. \n\nPrecision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them. \n\nRecall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.\n\n
| tokenizer | \nOpenCorpora | \nSynTagRus | \n||||
|---|---|---|---|---|---|---|
| Precision | \nRecall | \nExecution Time (sec) | \nPrecision | \nRecall | \nExecution Time (sec) | \n|
| nltk.sent_tokenize | \n94.30 | \n86.06 | \n8.67 | \n98.15 | \n94.95 | \n5.07 | \n
| nltk.sent_tokenize(x, language='russian') | \n95.53 | \n88.37 | \n8.54 | \n98.44 | \n95.45 | \n5.68 | \n
| bureaucratic-labs.segmentator.split | \n97.16 | \n88.62 | \n359 | \n96.79 | \n92.55 | \n210 | \n
| ru_sent_tokenize | \n98.73 | \n93.45 | \n4.92 | \n99.81 | \n98.59 | \n2.87 | \n