{ "info": { "author": "Esukhia development team", "author_email": "esukhiadev@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Natural Language :: Tibetan", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Text Processing :: Linguistic" ], "description": "# botok \u2013 Python Tibetan Tokenizer\n[![Build Status](https://travis-ci.org/Esukhia/pybo.svg?branch=master)](https://travis-ci.org/Esukhia/botok) [![Coverage Status](https://coveralls.io/repos/github/Esukhia/botok/badge.svg?branch=master)](https://coveralls.io/github/Esukhia/botok?branch=master) ![GitHub release](https://img.shields.io/github/release/Esukhia/botok.svg) [![CodeFactor](https://www.codefactor.io/repository/github/esukhia/botok/badge)](https://www.codefactor.io/repository/github/esukhia/botok) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://black.readthedocs.io/en/stable/)\n\n\n## Overview\n\nbotok tokenizes Tibetan text into words.\n\n### Basic usage\n\n#### Getting started\nRequires to have Python3 installed.\n\n pip3 install botok\n\n```python\n>>> from botok import Text\n\n>>> # input is a multi-line input string\n>>> in_str = \"\"\"\u0f63\u0f7a \u0f42\u0f66\u0f0d \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\u0f58\u0f50\u0f60\u0f72\u0f0b \u0f06 \u0f64\u0f72\u0f0b\u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b tr \n... \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a \u0f42\u0f66\u0f0d \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b\u0f21\u0f22\u0f23\u0f40\u0f40\u0f0d \n... \u0f58\u0f50\u0f60\u0f72\u0f0b\u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c\u0f62\u0f0b\u0f42\u0f53\u0f66\u0f0b\u0f54\u0f60\u0f72\u0f0b\u0f49\u0f66\u0f0b\u0f46\u0f74\u0f0b\u0f60\u0f50\u0f74\u0f44\u0f0b\u0f0d\u0f0d \u0f0d\u0f0d\u0f58\u0f41\u0f60\u0f0d\"\"\"\n\n\n### STEP1: instanciating Text\n\n>>> # A. on a string\n>>> t = Text(in_str)\n\n>>> # B. on a file\n... # note all following operations can be applied to files in this way.\n>>> from pathlib import Path\n>>> in_file = Path.cwd() / 'test.txt'\n\n>>> # file content:\n>>> in_file.read_text()\n'\u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0d\u0f0d\\n'\n\n>>> t = Text(in_file)\n>>> t.tokenize_chunks_plaintext\n\n>>> # checking an output file has been written:\n... # BOM is added by default so that notepad in Windows doesn't scramble the line breaks\n>>> out_file = Path.cwd() / 'test_pybo.txt'\n>>> out_file.read_text()\n'\\ufeff\u0f56\u0f40\u0fb2\u0f0b \u0f64\u0f72\u0f66\u0f0b \u0f56\u0f51\u0f7a\u0f0b \u0f63\u0f7a\u0f42\u0f66 \u0f0d\u0f0d'\n\n### STEP2: properties will perform actions on the input string:\n### note: original spaces are replaced by underscores.\n\n>>> # OUTPUT1: chunks are meaningful groups of chars from the input string.\n... # see how punctuations, numerals, non-bo and syllables are all neatly grouped.\n>>> t.tokenize_chunks_plaintext\n'\u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b \u0f64\u0f72\u0f66\u0f0b \u0f58\u0f50\u0f60\u0f72\u0f0b _\u0f06_ \u0f64\u0f72\u0f0b \u0f56\u0f40\u0fb2\u0f0b \u0f64\u0f72\u0f66\u0f0b__ tr_\\n \u0f56\u0f51\u0f7a\u0f0b\u0f0b \u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b \u0f64\u0f72\u0f66\u0f0b \u0f56\u0f51\u0f7a\u0f0b \u0f63\u0f7a\u0f42\u0f66\u0f0b \u0f21\u0f22\u0f23 \u0f40\u0f40 \u0f0d_\\n \u0f58\u0f50\u0f60\u0f72\u0f0b \u0f62\u0f92\u0fb1\u0f0b \u0f58\u0f5a\u0f7c\u0f62\u0f0b \u0f42\u0f53\u0f66\u0f0b \u0f54\u0f60\u0f72\u0f0b \u0f49\u0f66\u0f0b \u0f46\u0f74\u0f0b \u0f60\u0f50\u0f74\u0f44\u0f0b \u0f0d\u0f0d_\u0f0d\u0f0d \u0f58\u0f41\u0f60 \u0f0d'\n\n>>> # OUTPUT2: could as well be acheived by in_str.split(' ')\n>>> t.tokenize_on_spaces\n'\u0f63\u0f7a \u0f42\u0f66\u0f0d \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\u0f58\u0f50\u0f60\u0f72\u0f0b \u0f06 \u0f64\u0f72\u0f0b\u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b tr \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a \u0f42\u0f66\u0f0d \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b\u0f21\u0f22\u0f23\u0f40\u0f40\u0f0d \u0f58\u0f50\u0f60\u0f72\u0f0b\u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c\u0f62\u0f0b\u0f42\u0f53\u0f66\u0f0b\u0f54\u0f60\u0f72\u0f0b\u0f49\u0f66\u0f0b\u0f46\u0f74\u0f0b\u0f60\u0f50\u0f74\u0f44\u0f0b\u0f0d\u0f0d \u0f0d\u0f0d\u0f58\u0f41\u0f60\u0f0d'\n\n>>> # OUTPUT3: segments in words.\n... # see how \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a_\u0f42\u0f66 was still recognized as a single word, even with the space and the double tsek.\n... # the affixed particles are separated from the hosting word: \u0f58\u0f50 \u0f60\u0f72\u0f0b \u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c \u0f62\u0f0b \u0f42\u0f53\u0f66\u0f0b\u0f54 \u0f60\u0f72\u0f0b \u0f49 \u0f66\u0f0b\n>>> t.tokenize_words_raw_text\nLoading Trie... (2s.)\n'\u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b \u0f58\u0f50 \u0f60\u0f72\u0f0b _\u0f06_ \u0f64\u0f72\u0f0b \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b_ tr_ \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b \u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b \u0f21\u0f22\u0f23 \u0f40\u0f40 \u0f0d_ \u0f58\u0f50 \u0f60\u0f72\u0f0b \u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c \u0f62\u0f0b \u0f42\u0f53\u0f66\u0f0b\u0f54 \u0f60\u0f72\u0f0b \u0f49 \u0f66\u0f0b \u0f46\u0f74\u0f0b \u0f60\u0f50\u0f74\u0f44\u0f0b \u0f0d\u0f0d_\u0f0d\u0f0d \u0f58\u0f41\u0f60 \u0f0d'\n>>> t.tokenize_words_raw_lines\n'\u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b \u0f58\u0f50 \u0f60\u0f72\u0f0b _\u0f06_ \u0f64\u0f72\u0f0b \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b__ tr_\\n \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a_\u0f42\u0f66 \u0f0d_ \u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b \u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b \u0f21\u0f22\u0f23 \u0f40\u0f40 \u0f0d_\\n \u0f58\u0f50 \u0f60\u0f72\u0f0b \u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c \u0f62\u0f0b \u0f42\u0f53\u0f66\u0f0b\u0f54 \u0f60\u0f72\u0f0b \u0f49 \u0f66\u0f0b \u0f46\u0f74\u0f0b \u0f60\u0f50\u0f74\u0f44\u0f0b \u0f0d\u0f0d_\u0f0d\u0f0d \u0f58\u0f41\u0f60 \u0f0d'\n\n>>> # OUTPUT4: segments in words, then calculates the number of occurences of each word found\n... # by default, it counts in_str's substrings in the output, which is why we have \u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a \u0f42\u0f66\t1, \u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b\t1\n... # this behaviour can easily be modified to take into account the words that pybo recognized instead (see advanced usage)\n>>> print(t.list_word_types)\n\u0f60\u0f72\u0f0b\t3\n\u0f0d \t2\n\u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b\t2\n\u0f58\u0f50\t2\n\u0f63\u0f7a \u0f42\u0f66\t1\n \u0f06 \t1\n\u0f64\u0f72\u0f0b\t1\n\u0f56\u0f40\u0fb2\u0f0b\u0f64\u0f72\u0f66\u0f0b \t1\ntr \\n\t1\n\u0f56\u0f51\u0f7a\u0f0b\u0f0b\u0f63\u0f7a \u0f42\u0f66\t1\n\u0f56\u0f51\u0f7a\u0f0b\u0f63\u0f7a\u0f42\u0f66\u0f0b\t1\n\u0f21\u0f22\u0f23\t1\n\u0f40\u0f40\t1\n\u0f0d \\n\t1\n\u0f62\u0f92\u0fb1\u0f0b\u0f58\u0f5a\u0f7c\t1\n\u0f62\u0f0b\t1\n\u0f42\u0f53\u0f66\u0f0b\u0f54\t1\n\u0f49\t1\n\u0f66\u0f0b\t1\n\u0f46\u0f74\u0f0b\t1\n\u0f60\u0f50\u0f74\u0f44\u0f0b\t1\n\u0f0d\u0f0d \u0f0d\u0f0d\t1\n\u0f58\u0f41\u0f60\t1\n\u0f0d\t1\n```\n\n## Acknowledgements\n\n**botok** is an open source library for Tibetan NLP.\n\nWe are always open to cooperation in introducing new features, tool integrations and testing solutions.\n\nMany thanks to the companies and organizations who have supported botok's development, especially:\n\n* [Khyentse Foundation](https://khyentsefoundation.org) for contributing USD22,000 to kickstart the project \n* The [Barom/Esukhia canon project](http://www.barom.org) for sponsoring training data curation\n* [BDRC](https://tbrc.org) for contributing 2 staff for 6 months for data curation\n\n## Maintainance\n\nBuild the source dist:\n\n```\nrm -rf dist/\npython3 setup.py clean sdist\n```\n\nand upload on twine (version >= `1.11.0`) with:\n\n```\ntwine upload dist/*\n```\n\n## License\n\nThe Python code is Copyright (C) 2019 Esukhia, provided under [Apache 2](LICENSE). \n\ncontributors:\n * [Drupchen](https://github.com/drupchen)\n * [\u00c9lie Roux](https://github.com/eroux)\n * [Ngawang Trinley](https://github.com/ngawangtrinley)\n * [Mikko Kotila](https://github.com/mikkokotila)\n * [Thubten Rinzin](https://github.com/thubtenrigzin)\n\n * [Tenzin](https://github.com/10zinten)\n * Joyce Mackzenzie for reworking the logo", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Esukhia/botok", "keywords": "nlp computational_linguistics tibetan tokenizer token", "license": "Apache2", "maintainer": "", "maintainer_email": "", "name": "botok", "package_url": "https://pypi.org/project/botok/", "platform": "", "project_url": "https://pypi.org/project/botok/", "project_urls": { "Homepage": "https://github.com/Esukhia/botok", "Source": "https://github.com/Esukhia/botok", "Tracker": "https://github.com/Esukhia/botok/issues" }, "release_url": "https://pypi.org/project/botok/0.6.12/", "requires_dist": null, "requires_python": ">=3.6", "summary": "Tibetan Word Tokenizer", "version": "0.6.12" }, "last_serial": 5938738, "releases": { "0.6.10": [ { "comment_text": "", "digests": { "md5": "d5c7b7ec502fb0519c9cdf53989c1b43", "sha256": "4d2dcdf09c2f17392dd4224ad20bed2c5ea23f335267304971effe2ce3b9425d" }, "downloads": -1, "filename": "botok-0.6.10.tar.gz", "has_sig": false, "md5_digest": "d5c7b7ec502fb0519c9cdf53989c1b43", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1094226, "upload_time": "2019-09-12T20:06:11", "url": "https://files.pythonhosted.org/packages/e7/36/8063c19f5ece7f76a2192348f086c93fc1bea23b3541b62db9e95eed6e9b/botok-0.6.10.tar.gz" } ], "0.6.11": [ { "comment_text": "", "digests": { "md5": "babd20af63a0a3c8ea3e7632dac223ce", "sha256": "08582549ffa9775180ef3f32b839a9d7679c7730e30f7f80a3eea8ab5a7dac23" }, "downloads": -1, "filename": "botok-0.6.11.tar.gz", "has_sig": false, "md5_digest": "babd20af63a0a3c8ea3e7632dac223ce", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1094995, "upload_time": "2019-10-04T21:48:16", "url": "https://files.pythonhosted.org/packages/87/eb/7d4ada02c5b5644c9433e819ff90a6fb0bd8fb3bcb73c142428c2fdffb92/botok-0.6.11.tar.gz" } ], "0.6.12": [ { "comment_text": "", "digests": { "md5": "314bc84aa814ab5d70694aba3c4d619f", "sha256": "80fb3513f2a3e9a09dc4c926e28fddeb6e3256488a6ee40222d62e978298f613" }, "downloads": -1, "filename": "botok-0.6.12.tar.gz", "has_sig": false, "md5_digest": "314bc84aa814ab5d70694aba3c4d619f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1094935, "upload_time": "2019-10-07T13:35:43", "url": "https://files.pythonhosted.org/packages/fd/74/7e557557893e1b2ae0cfd7407477af9509f1c97646f6aac43d7acbcda5f3/botok-0.6.12.tar.gz" } ], "0.6.9": [ { "comment_text": "", "digests": { "md5": "e49970c9d2ec431cf33cf1af4e0badfa", "sha256": "d8b2c6ad17163a2258c3cca687b2b032a3a588b4fe61881a9fdd64cec2051fcb" }, "downloads": -1, "filename": "botok-0.6.9.tar.gz", "has_sig": false, "md5_digest": "e49970c9d2ec431cf33cf1af4e0badfa", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1094373, "upload_time": "2019-09-01T22:19:22", "url": "https://files.pythonhosted.org/packages/0b/b2/c6539b4b163b4f1360961d04bba2341a37c09d5b09113bcdae9d5a953fab/botok-0.6.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "314bc84aa814ab5d70694aba3c4d619f", "sha256": "80fb3513f2a3e9a09dc4c926e28fddeb6e3256488a6ee40222d62e978298f613" }, "downloads": -1, "filename": "botok-0.6.12.tar.gz", "has_sig": false, "md5_digest": "314bc84aa814ab5d70694aba3c4d619f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 1094935, "upload_time": "2019-10-07T13:35:43", "url": "https://files.pythonhosted.org/packages/fd/74/7e557557893e1b2ae0cfd7407477af9509f1c97646f6aac43d7acbcda5f3/botok-0.6.12.tar.gz" } ] }