{ "info": { "author": "Jexus Chuang", "author_email": "", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Programming Language :: Python", "Programming Language :: Python :: 3.6" ], "description": "# Pywordseg\n\u57fa\u65bc BiLSTM \u53ca ELMo \u7684 State-of-the-art \u958b\u6e90\u4e2d\u6587\u65b7\u8a5e\u7cfb\u7d71\u3002 \nAn open source state-of-the-art Chinese word segmentation system with BiLSTM and ELMo. \n\n- arXiv paper link: https://arxiv.org/abs/1901.05816\n- PyPI page: https://pypi.org/project/pywordseg/\n\n## Performance\n![](https://i.imgur.com/4WflkYS.png)\n- \u6b64\u5c08\u6848\u63d0\u4f9b\u5716\u4e2d\u7684 \"character level ELMo\" model \u4ee5\u53ca \"baseline\" model\uff0c\u5176\u4e2d \"character level ELMo\" model \u662f\u7576\u524d\u6e96\u78ba\u7387\u6700\u9ad8\u3002\u9019\u5169\u500b model \u90fd\u8d0f\u904e\u76ee\u524d\u5e38\u7528\u7684\u65b7\u8a5e\u7cfb\u7d71 [Jieba](https://github.com/fxsjy/jieba) (HMM-based) \u53ca [CKIP](http://ckipsvr.iis.sinica.edu.tw/) (rule-based) \u8a31\u591a\u3002 \n- This repo provides the \"character level ELMo\" model and \"baseline\" model in the figure. Our \"character level ELMo\" model outperforms the previous state-of-the-art Chinese word segmentation (Ma et al. 2018), and also largely outerform \"[Jieba](https://github.com/fxsjy/jieba)\" and \"[CKIP](http://ckipsvr.iis.sinica.edu.tw/)\", which are most popular toolkits in processing simplified/traditional Chinese text.\n\n![](https://i.imgur.com/0vCz0ui.png)\n- \u7576\u8655\u7406\u8a13\u7df4\u6642\u672a\u898b\u904e\u7684\u8a5e\u6642\uff0c\"character level ELMo\" model \u4ecd\u7136\u4fdd\u6709\u4e0d\u932f\u7684\u6b63\u78ba\u7387\uff0c\u76f8\u8f03\u65bc\"baseline\" model\u3002 \n- When considering OOV accuracy, our \"character level ELMo\" model outperforms our \"baseline\" model about 5%.\n\n## Usage\n### Requirements\n- python >= 3.6 (do not use 3.5)\n- pytorch 0.4\n- overrides\n\n### Install with Pip\n - `$ pip install pywordseg`\n - the module will automatically download the models while your first import within 1 minute.\n - if you use MacOS and encounter the [urllib.error.URLError](https://stackoverflow.com/questions/49183801/ssl-certificate-verify-failed-with-urllib) problem when downloading your models, \n try `$ sudo /Applications/Python\\ 3.6/Install\\ Certificates.command` to bypass the certificate issue.\n\n### Install manually\n - `$ git clone https://github.com/voidism/pywordseg`\n - download [ELMoForManyLangs.zip](https://www.dropbox.com/s/eiya6ztmjopprsm/ELMoForManyLangs.zip?dl=0) and unzip it to the `pywordseg/pywordseg` (the code of the ELMo model is from [HIT-SCIR](https://github.com/HIT-SCIR/ELMoForManyLangs), training by myself in character-level)\n - `$ pip install .` under the main directory\n\n### Segment!\n ```python\n # import the module\n from pywordseg import *\n\n # declare the segmentor.\n seg = Wordseg(batch_size=64, device=\"cuda:0\", embedding='elmo', elmo_use_cuda=True, mode=\"TW\")\n\n # input is a list of raw sentences.\n seg.cut([\"\u4eca\u5929\u5929\u6c23\u771f\u597d\u554a!\", \"\u6f6e\u6c34\u9000\u4e86\u5c31\u77e5\u9053\uff0c\u8ab0\u6c92\u7a7f\u8932\u5b50\u3002\"])\n\n # will return a list of lists of the segmented sentences.\n # [['\u4eca\u5929', '\u5929\u6c23', '\u771f', '\u597d', '\u554a', '!'], ['\u6f6e\u6c34', '\u9000', '\u4e86', '\u5c31', '\u77e5\u9053', ',', '\u8ab0', '\u6c92', '\u7a7f', '\u8932\u5b50', '\u3002']]\n ```\n#### Parameters:\n - **batch_size**: batch size for the word segmentation model, default: `64`.\n - **device**: the CPU/GPU device to run you model, default: `'cpu'`.\n - **embedding**: (default: `'w2v'`) \n - `'elmo'`: the loaded model will be the \"character level ELMo\" model above, which runs slow.\n - `'w2v'`: the loaded model will be the \"baseline model\" above, which runs faster than `'elmo'`.\n - **elmo_use_cuda**: if you want your ELMo model be accelerated on GPU, use `True`, otherwise the ELMo model will be run on CPU. This param is no use when `embedding='w2v'`. default: `True`.\n - **mode**: `WordSeg` will load different model according to the mode as listed below: (default: `TW`)\n - `TW`: trained on AS corpus, from CKIP, Academia Sinica, Taiwan.\n - `HK`: trained on CityU corpus, from City University of Hong Kong, Hong Kong SAR.\n - `CN_MSR`: trained on MSR corpus, from Microsoft Research, China.\n - `CN_PKU` or `CN`: trained on PKU corpus, from Peking University, China.\n\n### TODO\n- \u76ee\u524d\u53ea\u652f\u63f4\u7e41\u9ad4\u4e2d\u6587(\u5373\u4f7f\u9078\u64c7CN mode\uff0c\u6587\u5b57\u4e5f\u8981\u8f49\u63db\u6210\u7e41\u9ad4\u624d\u80fd\u904b\u4f5c\uff0c\u76ee\u524d\u8a13\u7df4\u8cc7\u6599\u90fd\u662f\u7d93\u904e [OpenCC](https://github.com/BYVoid/OpenCC) \u8f49\u63db\u7684)\uff0c\u65e5\u5f8c\u6703\u52a0\u5165\u7c21\u9ad4\u4e2d\u6587\u3002\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/voidism/pywordseg", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pywordseg", "package_url": "https://pypi.org/project/pywordseg/", "platform": "", "project_url": "https://pypi.org/project/pywordseg/", "project_urls": { "Homepage": "https://github.com/voidism/pywordseg" }, "release_url": "https://pypi.org/project/pywordseg/0.1.1/", "requires_dist": [ "torch", "h5py", "numpy", "overrides" ], "requires_python": "", "summary": "Open source state-of-the-art Chinese word segmentation toolkit", "version": "0.1.1" }, "last_serial": 4716040, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "580801467f38a375ce7c461a2a8ff150", "sha256": "6d8c60ef43493049abcd2de6273763fe8571a0c99be04aebfb1be4d324d822ae" }, "downloads": -1, "filename": "pywordseg-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "580801467f38a375ce7c461a2a8ff150", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6514, "upload_time": "2019-01-10T09:23:12", "url": "https://files.pythonhosted.org/packages/ec/60/2d95240425ea8ee4ecde49699680d858312778a948a629a503de8f856eb7/pywordseg-0.0.1-py3-none-any.whl" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "e6473f6daf42d4579f6a3fb1a0f9f65e", "sha256": "a367a21dcb46e0cd0e1268d4c8774cc85a665039d9d2050426e7e62b1e492529" }, "downloads": -1, "filename": "pywordseg-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "e6473f6daf42d4579f6a3fb1a0f9f65e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6514, "upload_time": "2019-01-10T09:27:40", "url": "https://files.pythonhosted.org/packages/ba/40/ac96c02fa89ed2841c9b813517b6fa38007cf79de7cc7985d9d3b4437419/pywordseg-0.0.2-py3-none-any.whl" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "f3019598112a0a13eada51370651962e", "sha256": "61feb8745ade47fa49a9fbbf0ccae3eac211192c8e6053241cd76eb95f5bc9a7" }, "downloads": -1, "filename": "pywordseg-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "f3019598112a0a13eada51370651962e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6515, "upload_time": "2019-01-10T09:33:48", "url": "https://files.pythonhosted.org/packages/02/73/0598adb67d961dcfeb31924ae0a95ecdde0b3a51f42968cd4a74e4188fce/pywordseg-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "519af4c57485b15d93628f86c67f782e", "sha256": "f2b0c44ce8d9bc37043de9732a88a7c447797901d84faf37e5309f355f805f67" }, "downloads": -1, "filename": "pywordseg-0.0.3.tar.gz", "has_sig": false, "md5_digest": "519af4c57485b15d93628f86c67f782e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6089, "upload_time": "2019-01-10T09:33:49", "url": "https://files.pythonhosted.org/packages/6a/a7/6c10019ccb559de3f76341f03b949ad9d0227071fe6693c4feb94ae5c194/pywordseg-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "896531745f8675354c1685d963f54590", "sha256": "b15e331a417ce046c798e5a37127bb36c082d590eba21abf2f906c0b20df1a6d" }, "downloads": -1, "filename": "pywordseg-0.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "896531745f8675354c1685d963f54590", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6519, "upload_time": "2019-01-10T09:46:31", "url": "https://files.pythonhosted.org/packages/bb/c6/5184360f6dfc3487bc23de2c51e148041865fcc93e3e376e3a48d059ebf2/pywordseg-0.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c2a8707d083645e76d2aac1e3b107a3d", "sha256": "94f01ea070a975751521ef6884a5b9e49926bf744d0319741cdf437bf571cf8d" }, "downloads": -1, "filename": "pywordseg-0.0.4.tar.gz", "has_sig": false, "md5_digest": "c2a8707d083645e76d2aac1e3b107a3d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6093, "upload_time": "2019-01-10T09:46:32", "url": "https://files.pythonhosted.org/packages/e7/60/c69a6d8b3ab4f34d1875d29e95f383002ab4a2dd0d5bfb778057ee67a51a/pywordseg-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "0207e029c9d4283980cf48dfaef9ad3f", "sha256": "0f5fc4c32b7e16233ae03caf97a3428d1863912e8591510c852ad55d8557fed0" }, "downloads": -1, "filename": "pywordseg-0.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "0207e029c9d4283980cf48dfaef9ad3f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6525, "upload_time": "2019-01-10T09:49:14", "url": "https://files.pythonhosted.org/packages/09/e6/7663e02a0206561b83df37bc77757da0f940309c393e714928ef6b26bf2b/pywordseg-0.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "69be8d700317f5ef43a8e77ea0b6f0e9", "sha256": "b0437ad6bdd3155b42e77dd1fdd682e0e26b84cb0b5944a87ba24387ee31efe0" }, "downloads": -1, "filename": "pywordseg-0.0.5.tar.gz", "has_sig": false, "md5_digest": "69be8d700317f5ef43a8e77ea0b6f0e9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6098, "upload_time": "2019-01-10T09:49:16", "url": "https://files.pythonhosted.org/packages/82/fc/6df222751908f8322eacc17a7cf788327f308f2d8e0f40a98e29c9b33136/pywordseg-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "a3a3daa816771c6f7338775c0c0d6e27", "sha256": "428b055bda103180320ed0e637e1bd20518e1b0b84f9bc14bd4073a35eb2406b" }, "downloads": -1, "filename": "pywordseg-0.0.6-py3-none-any.whl", "has_sig": false, "md5_digest": "a3a3daa816771c6f7338775c0c0d6e27", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6543, "upload_time": "2019-01-10T10:13:24", "url": "https://files.pythonhosted.org/packages/35/2a/ac48ce976389f41e535e96f5db3043dbe9a3c2d6ee2524a26d07fd721468/pywordseg-0.0.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6cba5e59ef8934491d7d67d480bf637c", "sha256": "a1eee573acc54d41eecd59eb59507d6dcc27054f7ba060b20992e0c44b7bc4de" }, "downloads": -1, "filename": "pywordseg-0.0.6.tar.gz", "has_sig": false, "md5_digest": "6cba5e59ef8934491d7d67d480bf637c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6112, "upload_time": "2019-01-10T10:13:27", "url": "https://files.pythonhosted.org/packages/dc/91/cac3420e37850e2722f7747b45c3514ab4ccd2a4675ab4ff2ca71e285115/pywordseg-0.0.6.tar.gz" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "29946fd4a266ea8a3be93e9f6af74b9f", "sha256": "73b9a7a1bf46652cdce2017fc3a27349eda087b665ea4a5e012d1b4358e4b528" }, "downloads": -1, "filename": "pywordseg-0.0.7-py3-none-any.whl", "has_sig": false, "md5_digest": "29946fd4a266ea8a3be93e9f6af74b9f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6219, "upload_time": "2019-01-10T10:45:50", "url": "https://files.pythonhosted.org/packages/2d/56/092f48ecff8eee6297cc3e222dbcbeb97559bcdd106acdaec9ecddaed694/pywordseg-0.0.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "dbf6e8114867c4ed8a0e2ed932912e56", "sha256": "0c2c3a2d36a3bc436996fa59898a428c78a42c65ad4e8062fe98a229d222ca7a" }, "downloads": -1, "filename": "pywordseg-0.0.7.tar.gz", "has_sig": false, "md5_digest": "dbf6e8114867c4ed8a0e2ed932912e56", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6117, "upload_time": "2019-01-10T10:45:51", "url": "https://files.pythonhosted.org/packages/2a/65/7c05e25c3cf9c7fffe928db6c22a4d62fbf905ee3cdaed99912d8d576751/pywordseg-0.0.7.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "7b7bf4771366eda5794270dd9a357197", "sha256": "a47e99913306a228cea0e0542292c6a7de8668cdc385583a383f3bd57d0d0ffa" }, "downloads": -1, "filename": "pywordseg-0.0.8-py3-none-any.whl", "has_sig": false, "md5_digest": "7b7bf4771366eda5794270dd9a357197", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6230, "upload_time": "2019-01-10T14:43:45", "url": "https://files.pythonhosted.org/packages/08/47/9acad53290ef2aa166ee575ac2773d62e46cba03746679e01e04a3582687/pywordseg-0.0.8-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d6f055bb7d94f9ebce89b614e2880545", "sha256": "8ed5c597bcba13b5d7f86d1f49317f137c61233da13579ecda2816db19e3db88" }, "downloads": -1, "filename": "pywordseg-0.0.8.tar.gz", "has_sig": false, "md5_digest": "d6f055bb7d94f9ebce89b614e2880545", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6175, "upload_time": "2019-01-10T14:43:47", "url": "https://files.pythonhosted.org/packages/db/8c/73d61cd6798995cfa3f52b0e82bdc9cbad88e6ff133cf0f77ba7a7cad122/pywordseg-0.0.8.tar.gz" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "1c8e1a8ab5cb5d2e35458578c2e8e22a", "sha256": "977a4e6f6e2b3a0d4849efaf09ad7a4ae0b2d21bed6984d221e2aef71d90f5b6" }, "downloads": -1, "filename": "pywordseg-0.0.9-py3-none-any.whl", "has_sig": false, "md5_digest": "1c8e1a8ab5cb5d2e35458578c2e8e22a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8098, "upload_time": "2019-01-11T15:55:12", "url": "https://files.pythonhosted.org/packages/ec/4f/adb250a7378b9fe5b15182c8262988c1d3ce41bf672fbc3a0d2cdf4e54ab/pywordseg-0.0.9-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c8a7c128677690f0581bf399b60d30f5", "sha256": "ebca499cc11e51da3e577768eb32458945d97450e4f30277f739747c60e3ee86" }, "downloads": -1, "filename": "pywordseg-0.0.9.tar.gz", "has_sig": false, "md5_digest": "c8a7c128677690f0581bf399b60d30f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6645, "upload_time": "2019-01-11T15:55:14", "url": "https://files.pythonhosted.org/packages/45/3a/e3bfd01c5452df3c66929acac7211fff665a70e9982581170eef5ad7a047/pywordseg-0.0.9.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "e0f0aba1481f09f7cd7e595a4781b098", "sha256": "f97ef6ea0c8231ee86d2e9cd9098f6412a68e355f333058764c2106977a32e85" }, "downloads": -1, "filename": "pywordseg-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "e0f0aba1481f09f7cd7e595a4781b098", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8286, "upload_time": "2019-01-18T11:51:39", "url": "https://files.pythonhosted.org/packages/1c/f7/bbecf8a1c0aa36cfc83ecfaae2176c296e24a6aff34075775de0d12d3b74/pywordseg-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1179e3b5219ca4cef36683eef0994856", "sha256": "3e2d3ec0af567abce004bc073f61436117a6b477e175abd0a5333b9ea362f65a" }, "downloads": -1, "filename": "pywordseg-0.1.0.tar.gz", "has_sig": false, "md5_digest": "1179e3b5219ca4cef36683eef0994856", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6840, "upload_time": "2019-01-18T11:51:41", "url": "https://files.pythonhosted.org/packages/96/41/903fd729c961fed49e7f4418a800054a138c9a5c131c2d34a0740e3679ed/pywordseg-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "ab164bf29dc4800375add56b18996291", "sha256": "5f514e7c27473410681c0479bcb70b148e5b05f07d1b8e0ede3ac4b26393ad51" }, "downloads": -1, "filename": "pywordseg-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "ab164bf29dc4800375add56b18996291", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8396, "upload_time": "2019-01-19T16:08:16", "url": "https://files.pythonhosted.org/packages/1e/15/70fb05edefee2bad4607f086c601c1eda8e2d7657e8f6503266c92bf44d0/pywordseg-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3d8149cf06ae274b23ccd864c601752f", "sha256": "6174f5ef53659fb47d3838c3b65bb3263c6a84b0791c96907339279e5d22967b" }, "downloads": -1, "filename": "pywordseg-0.1.1.tar.gz", "has_sig": false, "md5_digest": "3d8149cf06ae274b23ccd864c601752f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6947, "upload_time": "2019-01-19T16:08:17", "url": "https://files.pythonhosted.org/packages/a7/a5/3dc4253a25535086182b86fab5cce70d583c882d930de061d880fc5156ea/pywordseg-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ab164bf29dc4800375add56b18996291", "sha256": "5f514e7c27473410681c0479bcb70b148e5b05f07d1b8e0ede3ac4b26393ad51" }, "downloads": -1, "filename": "pywordseg-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "ab164bf29dc4800375add56b18996291", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8396, "upload_time": "2019-01-19T16:08:16", "url": "https://files.pythonhosted.org/packages/1e/15/70fb05edefee2bad4607f086c601c1eda8e2d7657e8f6503266c92bf44d0/pywordseg-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3d8149cf06ae274b23ccd864c601752f", "sha256": "6174f5ef53659fb47d3838c3b65bb3263c6a84b0791c96907339279e5d22967b" }, "downloads": -1, "filename": "pywordseg-0.1.1.tar.gz", "has_sig": false, "md5_digest": "3d8149cf06ae274b23ccd864c601752f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6947, "upload_time": "2019-01-19T16:08:17", "url": "https://files.pythonhosted.org/packages/a7/a5/3dc4253a25535086182b86fab5cce70d583c882d930de061d880fc5156ea/pywordseg-0.1.1.tar.gz" } ] }