{ "info": { "author": "Puck Treeratpituk", "author_email": "pucktada@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Cutkum ['\u0e04\u0e31\u0e14\u0e04\u0e33']\nCutkum ('\u0e04\u0e31\u0e14\u0e04\u0e33') is a python code for Thai Word-Segmentation using Recurrent Neural Network (RNN) based on Tensorflow library. \n\nCutkum is trained on BEST2010, a 5 Millions Thai words corpus by NECTEC (https://www.nectec.or.th/). It also comes with an already trained model, and can be used right out of the box. Cutkum is still a work-in-progress project. Evaluated on the 10% hold-out data from BEST2010 corpus (~600,000 words), the included trained model currently performs at \n\n98.0% recall, 96.3% precision, 97.1% F-measure (character-level)\n93.5% recall, 94.1% precision and 94.0% F-measure (word-level -- same evaluation method as BEST2010)\n\n# Update\nFeb 17, 2018 - add the training script\n\n# Requirements\n* python = 2.7, 3.0+\n* tensorflow = 1.4+\n\n# Installation\n\n`cutkum` can be installed using `pip` and the trained model can be downloaded from github. The current included model (model/lstm.l6.d2.pb) is a stacked bi-directional LSTM neural network with 6 layers. \n\n```\npip install cutkum\n\n# then download the trained model (either from github) or with wget\n\nwget https://raw.githubusercontent.com/pucktada/cutkum/master/model/lstm.l6.d2.pb\n```\n\n# Usages\n\nOnce installed, you can use `cutkum` within your python code to tokenize thai sentences. \n\n```\n\n>>> from cutkum.tokenizer import Cutkum\n\n>>> ck = Cutkum('lstm.l6.d2.pb')\n>>> words = ck.tokenize(\"\u0e2a\u0e32\u0e23\u0e32\u0e19\u0e38\u0e01\u0e23\u0e21\u0e44\u0e17\u0e22\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a\u0e40\u0e22\u0e32\u0e27\u0e0a\u0e19\u0e2f\")\n\n# python 3.0\n>>> words\n['\u0e2a\u0e32\u0e23\u0e32\u0e19\u0e38\u0e01\u0e23\u0e21', '\u0e44\u0e17\u0e22', '\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a', '\u0e40\u0e22\u0e32\u0e27\u0e0a\u0e19', '\u0e2f']\n\n# python 2.7\n>>> print(\"|\".join(words)) \n# \u0e2a\u0e32\u0e23\u0e32\u0e19\u0e38\u0e01\u0e23\u0e21|\u0e44\u0e17\u0e22|\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a|\u0e40\u0e22\u0e32\u0e27\u0e0a\u0e19|\u0e2f\n\n```\n\nYou can also use `cutkum` straight from the command line.\n\n```\nusage: cutkum [-h] [-v] -m MODEL_FILE\n (-s SENTENCE | -i INPUT_FILE | -id INPUT_DIR)\n [-o OUTPUT_FILE | -od OUTPUT_DIR] [--max | --viterbi]\n```\n\n```\ncutkum -m model/lstm.l6.d2.pb -s \"\u0e2a\u0e32\u0e23\u0e32\u0e19\u0e38\u0e01\u0e23\u0e21\u0e44\u0e17\u0e22\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a\u0e40\u0e22\u0e32\u0e27\u0e0a\u0e19\u0e2f\"\n\n# output as\n\u0e2a\u0e32\u0e23\u0e32\u0e19\u0e38\u0e01\u0e23\u0e21|\u0e44\u0e17\u0e22|\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a|\u0e40\u0e22\u0e32\u0e27\u0e0a\u0e19|\u0e2f\n```\n\n\n`cutkum` can also be used to segment text within a file (with -i), or to segment all the files within a given directory (with -id).\n\n```\ncutkum -m model/lstm.l6.d2.pb -i input.txt -o output.txt\ncutkum -m model/lstm.l6.d2.pb -id input_dir -od output_dir\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n\n## To Do\n* Improve performance, with better better model, and better included trained-model\n* Improve the speed when processing big file\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/pucktada/cutkum", "keywords": "thai tokenizer tensorflow lstm", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "cutkum", "package_url": "https://pypi.org/project/cutkum/", "platform": "", "project_url": "https://pypi.org/project/cutkum/", "project_urls": { "Homepage": "https://github.com/pucktada/cutkum" }, "release_url": "https://pypi.org/project/cutkum/2.4/", "requires_dist": null, "requires_python": "", "summary": "Thai Word-Segmentation with LSTM in Tensorflow", "version": "2.4" }, "last_serial": 3605806, "releases": { "1.4": [ { "comment_text": "", "digests": { "md5": "fa8719a94cdac97455919fbd53346fe6", "sha256": "38ec3cde074481a959890078c1e472e081657f988a8d1db8729c25c692dcbc37" }, "downloads": -1, "filename": "cutkum-1.4.tar.gz", "has_sig": false, "md5_digest": "fa8719a94cdac97455919fbd53346fe6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17426, "upload_time": "2018-01-26T01:46:16", "url": "https://files.pythonhosted.org/packages/f5/20/91c004bc2123e763a3e4b3eeee38626771613f74b01ca20a25e057dd7f5c/cutkum-1.4.tar.gz" } ], "1.4.2": [ { "comment_text": "", "digests": { "md5": "8d8b833d75cc3d79a5d5c22b1ef3c817", "sha256": "696b1c5aab15fcbcd69158a7dd2d758fc7e7a165bf3a3d108cc7cd4fa5b26a8e" }, "downloads": -1, "filename": "cutkum-1.4.2.tar.gz", "has_sig": false, "md5_digest": "8d8b833d75cc3d79a5d5c22b1ef3c817", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17595, "upload_time": "2018-02-17T09:01:27", "url": "https://files.pythonhosted.org/packages/37/24/3671a8da6ba7785a7e6b03ffb81a9e442906c963d076872d7c79495d10a3/cutkum-1.4.2.tar.gz" } ], "2.4": [ { "comment_text": "", "digests": { "md5": "440d1a3ca636f5bdb84a54f52588ef90", "sha256": "0a274c0d7ca31269756a563526d8211deb0f8614fb49bba1fdf943bf688b1ba3" }, "downloads": -1, "filename": "cutkum-2.4.tar.gz", "has_sig": false, "md5_digest": "440d1a3ca636f5bdb84a54f52588ef90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6164590, "upload_time": "2018-02-22T15:36:50", "url": "https://files.pythonhosted.org/packages/f6/94/cfcd80851a04c53f24d1da2dc320ab72879ccabad03c5c5b602942245555/cutkum-2.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "440d1a3ca636f5bdb84a54f52588ef90", "sha256": "0a274c0d7ca31269756a563526d8211deb0f8614fb49bba1fdf943bf688b1ba3" }, "downloads": -1, "filename": "cutkum-2.4.tar.gz", "has_sig": false, "md5_digest": "440d1a3ca636f5bdb84a54f52588ef90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6164590, "upload_time": "2018-02-22T15:36:50", "url": "https://files.pythonhosted.org/packages/f6/94/cfcd80851a04c53f24d1da2dc320ab72879ccabad03c5c5b602942245555/cutkum-2.4.tar.gz" } ] }