{ "info": { "author": "Hoi", "author_email": "hoiy927@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Berserker\nBerserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.\n\n## Installation\n```python\npip install basaka\n```\n\n## Usage\n```python\nimport berserker\n\nberserker.load_model() # An one-off installation\nberserker.tokenize('\u59d1\u59d1\u60f3\u904e\u904e\u904e\u5152\u904e\u904e\u7684\u751f\u6d3b\u3002') # ['\u59d1\u59d1', '\u60f3', '\u904e', '\u904e', '\u904e\u5152', '\u904e\u904e', '\u7684', '\u751f\u6d3b', '\u3002']\n```\n\n## Benchmark\nThe table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).\n\nThe result below is trained with 15 epoches on each dataset with a batch size of 64.\n\n| | PKU | CITYU | MSR | AS |\n|--------------------|----------|----------|----------|----------|\n| Liu et al. (2016) | **96.8** | -- | 97.3 | -- |\n| Yang et al. (2017) | 96.3 | 96.9 | 97.5 | 95.7 |\n| Zhou et al. (2017) | 96.0 | -- | 97.8 | -- |\n| Cai et al. (2017) | 95.8 | 95.6 | 97.1 | -- |\n| Chen et al. (2017) | 94.3 | 95.6 | 96.0 | 94.6 |\n| Wang and Xu (2017) | 96.5 | -- | 98.0 | -- |\n| Ma et al. (2018) | 96.1 | **97.2** | 98.1 | 96.2 |\n|--------------------|----------|----------|----------|----------|\n| Berserker | 96.6 | 97.1 | **98.4** | **96.5** |\n\nReference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)\n\n## Limitation\nSince Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.\n\nCurrently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.\n\n## Architecture\nBerserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.\n\n## Training\nWe provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.\n\n## Bonus Video\n[\"Yachae!!](https://www.youtube.com/watch?v=H_xmyvABZnE)\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "basaka", "package_url": "https://pypi.org/project/basaka/", "platform": "", "project_url": "https://pypi.org/project/basaka/", "project_urls": null, "release_url": "https://pypi.org/project/basaka/0.2.1/", "requires_dist": [ "requests", "six", "tqdm", "tensorflow (>=1.12.0)" ], "requires_python": "", "summary": "Berserker - BERt chineSE woRd toKenizER", "version": "0.2.1" }, "last_serial": 4799503, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "22a6977ce5b7f5fa1bad1f736f4dedfc", "sha256": "858e2b38710bdb03ded8048742dfe442dcb428e609ce748b49f4c89db742e8b8" }, "downloads": -1, "filename": "basaka-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "22a6977ce5b7f5fa1bad1f736f4dedfc", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 65980, "upload_time": "2019-01-19T17:24:20", "url": "https://files.pythonhosted.org/packages/60/05/f9ccccd3484d7ebc86329b731d0f4d3f71bc0a1bb40d9ad2fd47bf7b99ce/basaka-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "76fd49a08f46e2293d2ee548160f2145", "sha256": "537da8e1da9be08d44a64f2b837777ef00efea4f1d38a15abc8c1c81e5daedd2" }, "downloads": -1, "filename": "basaka-0.0.1.tar.gz", "has_sig": false, "md5_digest": "76fd49a08f46e2293d2ee548160f2145", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 77159, "upload_time": "2019-01-19T17:24:23", "url": "https://files.pythonhosted.org/packages/e9/4a/2ba96ba2beb3ee2ce45dcfc9dfbf4f8ab41b298f6059c094067fc8865dd9/basaka-0.0.1.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "95c79ac232e69319e62489ba5e6f528b", "sha256": "04fb1365d3632de8f134a30ac10b4239dbb4380451a4717b45e82834d17abab7" }, "downloads": -1, "filename": "basaka-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "95c79ac232e69319e62489ba5e6f528b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 70620, "upload_time": "2019-02-09T14:59:46", "url": "https://files.pythonhosted.org/packages/1b/41/c6e5cfce944b020a277c9f3491c9e094dc15a565875af6155becf0f6f3ad/basaka-0.2.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "590c5e641358deb36cd353cd8ebcdfaa", "sha256": "00985705ea711950378335a98f69b929dc602e3b92fd698f7ab76d4215e9cd69" }, "downloads": -1, "filename": "basaka-0.2.1.tar.gz", "has_sig": false, "md5_digest": "590c5e641358deb36cd353cd8ebcdfaa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 86487, "upload_time": "2019-02-09T14:59:49", "url": "https://files.pythonhosted.org/packages/4b/9e/84c6924c38e6871c1edf5191a825e5ca356a3cdcddaf91d71cb8e94a4d35/basaka-0.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "95c79ac232e69319e62489ba5e6f528b", "sha256": "04fb1365d3632de8f134a30ac10b4239dbb4380451a4717b45e82834d17abab7" }, "downloads": -1, "filename": "basaka-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "95c79ac232e69319e62489ba5e6f528b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 70620, "upload_time": "2019-02-09T14:59:46", "url": "https://files.pythonhosted.org/packages/1b/41/c6e5cfce944b020a277c9f3491c9e094dc15a565875af6155becf0f6f3ad/basaka-0.2.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "590c5e641358deb36cd353cd8ebcdfaa", "sha256": "00985705ea711950378335a98f69b929dc602e3b92fd698f7ab76d4215e9cd69" }, "downloads": -1, "filename": "basaka-0.2.1.tar.gz", "has_sig": false, "md5_digest": "590c5e641358deb36cd353cd8ebcdfaa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 86487, "upload_time": "2019-02-09T14:59:49", "url": "https://files.pythonhosted.org/packages/4b/9e/84c6924c38e6871c1edf5191a825e5ca356a3cdcddaf91d71cb8e94a4d35/basaka-0.2.1.tar.gz" } ] }