{ "info": { "author": "Jiajie Yan", "author_email": "jiaeyan@gmail.com", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Text Processing", "Topic :: Utilities" ], "description": "# \u7532\u8a00Jiayan\n[![PyPI](https://img.shields.io/badge/pypi-v0.0.21-blue.svg)](https://pypi.org/project/jiayan/)\n\n[\u4e2d\u6587](#\u7b80\u4ecb) \n[English](#introduction) \n\n## \u7b80\u4ecb\n\u7532\u8a00\uff0c\u53d6\u300c\u7532\u9aa8\u6587\u8a00\u300d\u4e4b\u610f\uff0c\u662f\u4e00\u6b3e\u4e13\u6ce8\u4e8e\u53e4\u6c49\u8bed\u5904\u7406\u7684NLP\u5de5\u5177\u5305\u3002 \n\u76ee\u524d\u901a\u7528\u7684\u6c49\u8bedNLP\u5de5\u5177\u5747\u4ee5\u73b0\u4ee3\u6c49\u8bed\u4e3a\u6838\u5fc3\u8bed\u6599\uff0c\u5bf9\u53e4\u4ee3\u6c49\u8bed\u7684\u5904\u7406\u6548\u679c\u5f88\u5dee(\u8be6\u89c1[\u5206\u8bcd](#2))\u3002\u672c\u9879\u76ee\u7684\u521d\u8877\uff0c\u4fbf\u662f\u8f85\u52a9\u53e4\u6c49\u8bed\u4fe1\u606f\u5904\u7406\uff0c\u5e2e\u52a9\u6709\u5fd7\u4e8e\u6316\u6398\u53e4\u6587\u5316\u77ff\u85cf\u7684\u53e4\u6c49\u8bed\u5b66\u8005\u3001\u7231\u597d\u8005\u7b49\u66f4\u597d\u5730\u5206\u6790\u548c\u5229\u7528\u6587\u8a00\u8d44\u6599\uff0c\u4ece\u300c\u6587\u5316\u9057\u4ea7\u300d\u4e2d\u521b\u9020\u51fa\u300c\u6587\u5316\u65b0\u4ea7\u300d\u3002 \n\u5f53\u524d\u7248\u672c\u652f\u6301[\u8bcd\u5e93\u6784\u5efa](#1)\u3001[\u81ea\u52a8\u5206\u8bcd](#2)\u3001[\u8bcd\u6027\u6807\u6ce8](#3)\u3001[\u6587\u8a00\u53e5\u8bfb](#4)\u548c[\u6807\u70b9](#5)\u4e94\u9879\u529f\u80fd\uff0c\u66f4\u591a\u529f\u80fd\u6b63\u5728\u5f00\u53d1\u4e2d\u3002 \n \n## \u529f\u80fd \n* [__\u8bcd\u5e93\u6784\u5efa__](#1) \n * \u5229\u7528\u65e0\u76d1\u7763\u7684\u53cc[\u5b57\u5178\u6811](https://baike.baidu.com/item/Trie\u6811)\u3001[\u70b9\u4e92\u4fe1\u606f](https://www.jianshu.com/p/79de56cbb2c7)\u4ee5\u53ca\u5de6\u53f3\u90bb\u63a5[\u71b5](https://baike.baidu.com/item/\u4fe1\u606f\u71b5/7302318?fr=aladdin)\u8fdb\u884c\u6587\u8a00\u8bcd\u5e93\u81ea\u52a8\u6784\u5efa\u3002\n* [__\u5206\u8bcd__](#2) \n * \u5229\u7528\u65e0\u76d1\u7763\u3001\u65e0\u8bcd\u5178\u7684[N\u5143\u8bed\u6cd5](https://baike.baidu.com/item/n\u5143\u8bed\u6cd5)\u548c[\u9690\u9a6c\u5c14\u53ef\u592b\u6a21\u578b](https://baike.baidu.com/item/\u9690\u9a6c\u5c14\u53ef\u592b\u6a21\u578b)\u8fdb\u884c\u53e4\u6c49\u8bed\u81ea\u52a8\u5206\u8bcd\u3002\n * \u5229\u7528\u8bcd\u5e93\u6784\u5efa\u529f\u80fd\u4ea7\u751f\u7684\u6587\u8a00\u8bcd\u5178\uff0c\u57fa\u4e8e\u6709\u5411\u65e0\u73af\u8bcd\u56fe\u3001\u53e5\u5b50\u6700\u5927\u6982\u7387\u8def\u5f84\u548c\u52a8\u6001\u89c4\u5212\u7b97\u6cd5\u8fdb\u884c\u5206\u8bcd\u3002\n* [__\u8bcd\u6027\u6807\u6ce8__](#3) \n * \u57fa\u4e8e\u8bcd\u7684[\u6761\u4ef6\u968f\u673a\u573a](https://baike.baidu.com/item/\u6761\u4ef6\u968f\u673a\u573a)\u7684\u5e8f\u5217\u6807\u6ce8\uff0c\u8bcd\u6027\u8be6\u89c1[\u8bcd\u6027\u8868](jiayan/postagger/README.md)\u3002\n* [__\u65ad\u53e5__](#4)\n * \u57fa\u4e8e\u5b57\u7b26\u7684\u6761\u4ef6\u968f\u673a\u573a\u7684\u5e8f\u5217\u6807\u6ce8\uff0c\u5f15\u5165\u70b9\u4e92\u4fe1\u606f\u53ca[t-\u6d4b\u8bd5\u503c](https://baike.baidu.com/item/t\u68c0\u9a8c/9910799?fr=aladdin)\u4e3a\u7279\u5f81\uff0c\u5bf9\u6587\u8a00\u6bb5\u843d\u8fdb\u884c\u81ea\u52a8\u65ad\u53e5\u3002\n* [__\u6807\u70b9__](#5)\n * \u57fa\u4e8e\u5b57\u7b26\u7684\u5c42\u53e0\u5f0f\u6761\u4ef6\u968f\u673a\u573a\u7684\u5e8f\u5217\u6807\u6ce8\uff0c\u5728\u65ad\u53e5\u7684\u57fa\u7840\u4e0a\u5bf9\u6587\u8a00\u6bb5\u843d\u8fdb\u884c\u81ea\u52a8\u6807\u70b9\u3002\n* \u6587\u767d\u7ffb\u8bd1\n * \u5f00\u53d1\u4e2d\uff0c\u76ee\u524d\u5904\u4e8e\u6587\u767d\u5e73\u884c\u8bed\u6599\u6536\u96c6\u3001\u6e05\u6d17\u9636\u6bb5\u3002\n * \u57fa\u4e8e[\u53cc\u5411\u957f\u77ed\u65f6\u8bb0\u5fc6\u5faa\u73af\u7f51\u7edc](https://baike.baidu.com/item/\u957f\u77ed\u671f\u8bb0\u5fc6\u4eba\u5de5\u795e\u7ecf\u7f51\u7edc/17541107?fromtitle=LSTM&fromid=17541102&fr=aladdin)\u548c[\u6ce8\u610f\u529b\u673a\u5236](https://baike.baidu.com/item/\u6ce8\u610f\u529b\u673a\u5236)\u7684\u795e\u7ecf\u7f51\u7edc\u751f\u6210\u6a21\u578b\uff0c\u5bf9\u53e4\u6587\u8fdb\u884c\u81ea\u52a8\u7ffb\u8bd1\u3002\n* \u6ce8\u610f\uff1a\u53d7\u8bed\u6599\u5f71\u54cd\uff0c\u76ee\u524d\u4e0d\u652f\u6301\u7e41\u4f53\u3002\u5982\u9700\u5904\u7406\u7e41\u4f53\uff0c\u53ef\u5148\u7528[OpenCC](https://github.com/yichen0831/opencc-python)\u5c06\u8f93\u5165\u8f6c\u6362\u4e3a\u7b80\u4f53\uff0c\u518d\u5c06\u7ed3\u679c\u8f6c\u5316\u4e3a\u76f8\u5e94\u7e41\u4f53(\u5982\u6e2f\u6fb3\u53f0\u7b49)\u3002 \n\n## \u5b89\u88c5 \n $ pip install jiayan \n $ pip install https://github.com/kpu/kenlm/archive/master.zip\n\n## \u4f7f\u7528 \n\u4ee5\u4e0b\u5404\u6a21\u5757\u7684\u4f7f\u7528\u65b9\u6cd5\u5747\u6765\u81ea[examples.py](jiayan/examples.py)\u3002\n1. \u4e0b\u8f7d\u6a21\u578b\u5e76\u89e3\u538b\uff1a[\u767e\u5ea6\u7f51\u76d8](https://pan.baidu.com/s/1PXP0eSQWWcNmAb6lkuB5sw)\uff0c\u63d0\u53d6\u7801\uff1a`p0sc`\n * jiayan.klm\uff1a\u8bed\u8a00\u6a21\u578b\uff0c\u4e3b\u8981\u7528\u6765\u5206\u8bcd\uff0c\u4ee5\u53ca\u53e5\u8bfb\u6807\u70b9\u4efb\u52a1\u4e2d\u7684\u7279\u5f81\u63d0\u53d6\uff1b \n * pos_model\uff1aCRF\u8bcd\u6027\u6807\u6ce8\u6a21\u578b\uff1b\n * cut_model\uff1aCRF\u53e5\u8bfb\u6a21\u578b\uff1b\n * punc_model\uff1aCRF\u6807\u70b9\u6a21\u578b\uff1b\n * \u5e84\u5b50.txt\uff1a\u7528\u6765\u6d4b\u8bd5\u8bcd\u5e93\u6784\u5efa\u7684\u5e84\u5b50\u5168\u6587\u3002\n \n2. __\u8bcd\u5e93\u6784\u5efa__ \n ```\n from jiayan import PMIEntropyLexiconConstructor\n \n constructor = PMIEntropyLexiconConstructor()\n lexicon = constructor.construct_lexicon('\u5e84\u5b50.txt')\n constructor.save(lexicon, '\u5e84\u5b50\u8bcd\u5e93.csv')\n ```\n \n \u7ed3\u679c\uff1a \n ```\n Word,Frequency,PMI,R_Entropy,L_Entropy\n \u4e4b,2999,80,7.944909328101839,8.279435615456894\n \u800c,2089,80,7.354575005231323,8.615211168836439\n \u4e0d,1941,80,7.244331150611089,6.362131306822925\n ...\n \u5929\u4e0b,280,195.23602384978196,5.158574399464853,5.24731990592901\n \u5723\u4eba,111,150.0620531154239,4.622606551534004,4.6853474419338585\n \u4e07\u7269,94,377.59805590304126,4.5959107835319895,4.538837960294887\n \u5929\u5730,92,186.73504238078462,3.1492586603863617,4.894533538722486\n \u5b54\u5b50,80,176.2550051738876,4.284638190120882,2.4056390622295662\n \u5e84\u5b50,76,169.26227942514097,2.328252899085616,2.1920058354921066\n \u4ec1\u4e49,58,882.3468468468468,3.501609497059026,4.96900162987599\n \u8001\u8043,45,2281.2228260869565,2.384853500510039,2.4331958387289765\n ...\n ```\n3. __\u5206\u8bcd__ \n 1. \u5b57\u7b26\u7ea7\u9690\u9a6c\u5c14\u53ef\u592b\u6a21\u578b\u5206\u8bcd\uff0c\u6548\u679c\u7b26\u5408\u8bed\u611f\uff0c\u5efa\u8bae\u4f7f\u7528\uff0c\u9700\u52a0\u8f7d\u8bed\u8a00\u6a21\u578b `jiayan.klm`\n ```\n from jiayan import load_lm\n from jiayan import CharHMMTokenizer\n \n text = '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u3002'\n \n lm = load_lm('jiayan.klm')\n tokenizer = CharHMMTokenizer(lm)\n print(list(tokenizer.tokenize(text)))\n ```\n \u7ed3\u679c\uff1a \n `['\u662f', '\u6545', '\u5185\u5723\u5916\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d', '\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109', '\u4ee5', '\u81ea', '\u4e3a', '\u65b9', '\u3002']` \n \n \u7531\u4e8e\u53e4\u6c49\u8bed\u6ca1\u6709\u516c\u5f00\u5206\u8bcd\u6570\u636e\uff0c\u65e0\u6cd5\u505a\u6548\u679c\u8bc4\u4f30\uff0c\u4f46\u6211\u4eec\u53ef\u4ee5\u901a\u8fc7\u4e0d\u540cNLP\u5de5\u5177\u5bf9\u76f8\u540c\u53e5\u5b50\u7684\u5904\u7406\u7ed3\u679c\u6765\u76f4\u89c2\u611f\u53d7\u672c\u9879\u76ee\u7684\u4f18\u52bf: \n \n \u8bd5\u6bd4\u8f83 [LTP](https://github.com/HIT-SCIR/ltp) (3.4.0) \u6a21\u578b\u5206\u8bcd\u7ed3\u679c\uff1a \n `['\u662f', '\u6545\u5185', '\u5723\u5916\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697\u800c\u4e0d\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109\u4ee5\u81ea\u4e3a\u65b9', '\u3002']` \n \n \u518d\u8bd5\u6bd4\u8f83 [HanLP](http://hanlp.com) \u5206\u8bcd\u7ed3\u679c\uff1a \n `['\u662f\u6545', '\u5185', '\u5723', '\u5916', '\u738b\u4e4b\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404\u4e3a\u5176\u6240\u6b32\u7109', '\u4ee5', '\u81ea\u4e3a', '\u65b9', '\u3002']` \n \n \u53ef\u89c1\u672c\u5de5\u5177\u5bf9\u53e4\u6c49\u8bed\u7684\u5206\u8bcd\u6548\u679c\u660e\u663e\u4f18\u4e8e\u901a\u7528\u6c49\u8bedNLP\u5de5\u5177\u3002 \n \n 2. \u8bcd\u7ea7\u6700\u5927\u6982\u7387\u8def\u5f84\u5206\u8bcd\uff0c\u57fa\u672c\u4ee5\u5b57\u4e3a\u5355\u4f4d\uff0c\u9897\u7c92\u5ea6\u8f83\u7c97\n ```\n from jiayan import WordNgramTokenizer\n \n text = '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u3002'\n tokenizer = WordNgramTokenizer()\n print(list(tokenizer.tokenize(text)))\n ```\n \u7ed3\u679c\uff1a \n `['\u662f', '\u6545', '\u5185', '\u5723', '\u5916', '\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d', '\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109', '\u4ee5', '\u81ea', '\u4e3a', '\u65b9', '\u3002']` \n\n4. __\u8bcd\u6027\u6807\u6ce8__\n ```\n from jiayan import CRFPOSTagger\n \n words = ['\u5929\u4e0b', '\u5927\u4e71', '\uff0c', '\u8d24\u5723', '\u4e0d', '\u660e', '\uff0c', '\u9053\u5fb7', '\u4e0d', '\u4e00', '\uff0c', '\u5929\u4e0b', '\u591a', '\u5f97', '\u4e00', '\u5bdf', '\u7109', '\u4ee5', '\u81ea', '\u597d', '\u3002']\n \n postagger = CRFPOSTagger()\n postagger.load('pos_model')\n print(postagger.postag(words))\n ```\n \u7ed3\u679c\uff1a \n `['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']` \n\n5. __\u65ad\u53e5__\n ```\n from jiayan import load_lm\n from jiayan import CRFSentencizer\n \n text = '\u5929\u4e0b\u5927\u4e71\u8d24\u5723\u4e0d\u660e\u9053\u5fb7\u4e0d\u4e00\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\u8b6c\u5982\u8033\u76ee\u7686\u6709\u6240\u660e\u4e0d\u80fd\u76f8\u901a\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\u7686\u6709\u6240\u957f\u65f6\u6709\u6240\u7528\u867d\u7136\u4e0d\u8be5\u4e0d\u904d\u4e00\u4e4b\u58eb\u4e5f\u5224\u5929\u5730\u4e4b\u7f8e\u6790\u4e07\u7269\u4e4b\u7406\u5bdf\u53e4\u4eba\u4e4b\u5168\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\u79f0\u795e\u4e4b\u5bb9\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\u6697\u800c\u4e0d\u660e\u90c1\u800c\u4e0d\u53d1\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u60b2\u592b\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\u5fc5\u4e0d\u5408\u77e3\u540e\u4e16\u4e4b\u5b66\u8005\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\u53e4\u4e4b\u5927\u4f53\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2'\n \n lm = load_lm('jiayan.klm')\n sentencizer = CRFSentencizer(lm)\n sentencizer.load('cut_model')\n print(sentencizer.sentencize(text))\n ```\n \u7ed3\u679c\uff1a \n `['\u5929\u4e0b\u5927\u4e71', '\u8d24\u5723\u4e0d\u660e', '\u9053\u5fb7\u4e0d\u4e00', '\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d', '\u8b6c\u5982\u8033\u76ee', '\u7686\u6709\u6240\u660e', '\u4e0d\u80fd\u76f8\u901a', '\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f', '\u7686\u6709\u6240\u957f', '\u65f6\u6709\u6240\u7528', '\u867d\u7136', '\u4e0d\u8be5\u4e0d\u904d', '\u4e00\u4e4b\u58eb\u4e5f', '\u5224\u5929\u5730\u4e4b\u7f8e', '\u6790\u4e07\u7269\u4e4b\u7406', '\u5bdf\u53e4\u4eba\u4e4b\u5168', '\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e', '\u79f0\u795e\u4e4b\u5bb9', '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053', '\u6697\u800c\u4e0d\u660e', '\u90c1\u800c\u4e0d\u53d1', '\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9', '\u60b2\u592b', '\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd', '\u5fc5\u4e0d\u5408\u77e3', '\u540e\u4e16\u4e4b\u5b66\u8005', '\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf', '\u53e4\u4e4b\u5927\u4f53', '\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2']` \n\n6. __\u6807\u70b9__\n ```\n from jiayan import load_lm\n from jiayan import CRFPunctuator\n \n text = '\u5929\u4e0b\u5927\u4e71\u8d24\u5723\u4e0d\u660e\u9053\u5fb7\u4e0d\u4e00\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\u8b6c\u5982\u8033\u76ee\u7686\u6709\u6240\u660e\u4e0d\u80fd\u76f8\u901a\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\u7686\u6709\u6240\u957f\u65f6\u6709\u6240\u7528\u867d\u7136\u4e0d\u8be5\u4e0d\u904d\u4e00\u4e4b\u58eb\u4e5f\u5224\u5929\u5730\u4e4b\u7f8e\u6790\u4e07\u7269\u4e4b\u7406\u5bdf\u53e4\u4eba\u4e4b\u5168\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\u79f0\u795e\u4e4b\u5bb9\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\u6697\u800c\u4e0d\u660e\u90c1\u800c\u4e0d\u53d1\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u60b2\u592b\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\u5fc5\u4e0d\u5408\u77e3\u540e\u4e16\u4e4b\u5b66\u8005\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\u53e4\u4e4b\u5927\u4f53\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2'\n \n lm = load_lm('jiayan.klm')\n punctuator = CRFPunctuator(lm, 'cut_model')\n punctuator.load('punc_model')\n print(punctuator.punctuate(text))\n ```\n \u7ed3\u679c\uff1a \n `\u5929\u4e0b\u5927\u4e71\uff0c\u8d24\u5723\u4e0d\u660e\uff0c\u9053\u5fb7\u4e0d\u4e00\uff0c\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\uff0c\u8b6c\u5982\u8033\u76ee\uff0c\u7686\u6709\u6240\u660e\uff0c\u4e0d\u80fd\u76f8\u901a\uff0c\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\uff0c\u7686\u6709\u6240\u957f\uff0c\u65f6\u6709\u6240\u7528\uff0c\u867d\u7136\uff0c\u4e0d\u8be5\u4e0d\u904d\uff0c\u4e00\u4e4b\u58eb\u4e5f\uff0c\u5224\u5929\u5730\u4e4b\u7f8e\uff0c\u6790\u4e07\u7269\u4e4b\u7406\uff0c\u5bdf\u53e4\u4eba\u4e4b\u5168\uff0c\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\uff0c\u79f0\u795e\u4e4b\u5bb9\uff0c\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\uff0c\u60b2\u592b\uff01\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\uff0c\u5fc5\u4e0d\u5408\u77e3\uff0c\u540e\u4e16\u4e4b\u5b66\u8005\uff0c\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\uff0c\u53e4\u4e4b\u5927\u4f53\uff0c\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2\u3002`\n\n\n## \u7248\u672c\n* v0.0.21\n * \u5c06\u5b89\u88c5\u8fc7\u7a0b\u5206\u4e3a\u4e24\u6b65\uff0c\u786e\u4fdd\u5f97\u5230\u6700\u65b0\u7684kenlm\u7248\u672c\u3002 \n* v0.0.2\n * \u589e\u52a0\u8bcd\u6027\u6807\u6ce8\u529f\u80fd\u3002\n* v0.0.1\n * \u8bcd\u5e93\u6784\u5efa\u3001\u81ea\u52a8\u5206\u8bcd\u3001\u6587\u8a00\u53e5\u8bfb\u3001\u6807\u70b9\u529f\u80fd\u5f00\u653e\u3002\n \n \n---\n\n## Introduction\nJiayan, which means Chinese characters engraved on oracle bones, is a professional Python NLP tool for Classical Chinese. \nPrevailing Chinese NLP tools are mainly trained on modern Chinese data, which leads to bad performance on Classical Chinese (See [__Tokenizing__](#6)). The purpose of this project is to assist Classical Chinese information processing. \nCurrent version supports [lexicon construction](#6), [tokenizing](#7), [POS tagging](#8), [sentence segmentation](#9) and [automatic punctuation](#10), more features are in development. \n \n## Features \n* [__Lexicon Construction__](#6) \n * With an unsupervised approach, construct lexicon with [Trie](https://en.wikipedia.org/wiki/Trie) -tree, [PMI](https://en.wikipedia.org/wiki/Pointwise_mutual_information) (_point-wise mutual information_) and neighboring [entropy](https://en.wikipedia.org/wiki/Entropy_\\(information_theory\\)) of left and right characters. \n* [__Tokenizing__](#7) \n * With an unsupervised, no dictionary approach to tokenize a Classical Chinese sentence with [N-gram](https://en.wikipedia.org/wiki/N-gram) language model and [HMM](https://en.wikipedia.org/wiki/Hidden_Markov_model) (_Hidden Markov Model_). \n * With the dictionary produced from lexicon construction, tokenize a Classical Chinese sentence with Directed Acyclic Word Graph, Max Probability Path and [Dynamic Programming](https://en.wikipedia.org/wiki/Dynamic_programming). \n* [__POS Tagging__](#8) \n * Word level sequence tagging with [CRF](https://en.wikipedia.org/wiki/Conditional_random_field) (_Conditional Random Field_). See POS tag categories [here](jiayan/postagger/README.md). \n* [__Sentence Segmentation__](#9)\n * Character level sequence tagging with CRF, introduces PMI and [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test) values as features. \n* [__Punctuation__](#10)\n * Character level sequence tagging with layered CRFs, punctuate given Classical Chinese texts based on results of sentence segmentation. \n* Note: Due to data we used, we don't support traditional Chinese for now. If you have to process traditional one, please use [OpenCC](https://github.com/yichen0831/opencc-python) to convert traditional input to simplified, then you could convert the results back. \n\n## Installation \n $ pip install jiayan \n $ pip install https://github.com/kpu/kenlm/archive/master.zip\n\n## Usages \nThe usage codes below are all from [examples.py](jiayan/examples.py). \n1. Download the models and unzip them\uff1a[Google Drive](https://drive.google.com/open?id=1piZQBO8OXQ5Cpi17vAcZsrbJLPABnKzp)\n * jiayan.klm\uff1athe language model used for tokenizing and feature extraction for sentence segmentation and punctuation; \n * pos_model\uff1athe CRF model for POS tagging;\n * cut_model\uff1athe CRF model for sentence segmentation;\n * punc_model\uff1athe CRF model for punctuation; \n * \u5e84\u5b50.txt\uff1athe full text of \u300aZhuangzi\u300b used for testing lexicon construction. \n \n2. __Lexicon Construction__ \n ```\n from jiayan import PMIEntropyLexiconConstructor\n \n constructor = PMIEntropyLexiconConstructor()\n lexicon = constructor.construct_lexicon('\u5e84\u5b50.txt')\n constructor.save(lexicon, 'Zhuangzi_Lexicon.csv')\n ```\n \n Result\uff1a \n ```\n Word,Frequency,PMI,R_Entropy,L_Entropy\n \u4e4b,2999,80,7.944909328101839,8.279435615456894\n \u800c,2089,80,7.354575005231323,8.615211168836439\n \u4e0d,1941,80,7.244331150611089,6.362131306822925\n ...\n \u5929\u4e0b,280,195.23602384978196,5.158574399464853,5.24731990592901\n \u5723\u4eba,111,150.0620531154239,4.622606551534004,4.6853474419338585\n \u4e07\u7269,94,377.59805590304126,4.5959107835319895,4.538837960294887\n \u5929\u5730,92,186.73504238078462,3.1492586603863617,4.894533538722486\n \u5b54\u5b50,80,176.2550051738876,4.284638190120882,2.4056390622295662\n \u5e84\u5b50,76,169.26227942514097,2.328252899085616,2.1920058354921066\n \u4ec1\u4e49,58,882.3468468468468,3.501609497059026,4.96900162987599\n \u8001\u8043,45,2281.2228260869565,2.384853500510039,2.4331958387289765\n ...\n ```\n3. __Tokenizing__ \n 1. The character based HMM, recommended, needs language model: `jiayan.klm`\n ```\n from jiayan import load_lm\n from jiayan import CharHMMTokenizer\n \n text = '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u3002'\n \n lm = load_lm('jiayan.klm')\n tokenizer = CharHMMTokenizer(lm)\n print(list(tokenizer.tokenize(text)))\n ```\n Result\uff1a \n `['\u662f', '\u6545', '\u5185\u5723\u5916\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d', '\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109', '\u4ee5', '\u81ea', '\u4e3a', '\u65b9', '\u3002']` \n \n Since there is no public tokenizing data for Classical Chinese, it's hard to do performance evaluation directly; However, we can compare the results with other popular modern Chinese NLP tools to check the performance: \n \n Compare the tokenizing result of [LTP](https://github.com/HIT-SCIR/ltp) (3.4.0): \n `['\u662f', '\u6545\u5185', '\u5723\u5916\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697\u800c\u4e0d\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109\u4ee5\u81ea\u4e3a\u65b9', '\u3002']` \n \n Also, compare the tokenizing result of [HanLP](http://hanlp.com): \n `['\u662f\u6545', '\u5185', '\u5723', '\u5916', '\u738b\u4e4b\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404\u4e3a\u5176\u6240\u6b32\u7109', '\u4ee5', '\u81ea\u4e3a', '\u65b9', '\u3002']` \n \n It's apparent that Jiayan has much better tokenizing performance than general Chinese NLP tools. \n \n 2. Max probability path approach tokenizing based on words\n ```\n from jiayan import WordNgramTokenizer\n \n text = '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u3002'\n tokenizer = WordNgramTokenizer()\n print(list(tokenizer.tokenize(text)))\n ```\n Result: \n `['\u662f', '\u6545', '\u5185', '\u5723', '\u5916', '\u738b', '\u4e4b', '\u9053', '\uff0c', '\u6697', '\u800c', '\u4e0d', '\u660e', '\uff0c', '\u90c1', '\u800c', '\u4e0d', '\u53d1', '\uff0c', '\u5929\u4e0b', '\u4e4b', '\u4eba', '\u5404', '\u4e3a', '\u5176', '\u6240', '\u6b32', '\u7109', '\u4ee5', '\u81ea', '\u4e3a', '\u65b9', '\u3002']` \n\n4. __POS Tagging__\n ```\n from jiayan import CRFPOSTagger\n \n words = ['\u5929\u4e0b', '\u5927\u4e71', '\uff0c', '\u8d24\u5723', '\u4e0d', '\u660e', '\uff0c', '\u9053\u5fb7', '\u4e0d', '\u4e00', '\uff0c', '\u5929\u4e0b', '\u591a', '\u5f97', '\u4e00', '\u5bdf', '\u7109', '\u4ee5', '\u81ea', '\u597d', '\u3002']\n \n postagger = CRFPOSTagger()\n postagger.load('pos_model')\n print(postagger.postag(words))\n ```\n Result: \n `['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']` \n\n4. __Sentence Segmentation__\n ```\n from jiayan import load_lm\n from jiayan import CRFSentencizer\n \n text = '\u5929\u4e0b\u5927\u4e71\u8d24\u5723\u4e0d\u660e\u9053\u5fb7\u4e0d\u4e00\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\u8b6c\u5982\u8033\u76ee\u7686\u6709\u6240\u660e\u4e0d\u80fd\u76f8\u901a\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\u7686\u6709\u6240\u957f\u65f6\u6709\u6240\u7528\u867d\u7136\u4e0d\u8be5\u4e0d\u904d\u4e00\u4e4b\u58eb\u4e5f\u5224\u5929\u5730\u4e4b\u7f8e\u6790\u4e07\u7269\u4e4b\u7406\u5bdf\u53e4\u4eba\u4e4b\u5168\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\u79f0\u795e\u4e4b\u5bb9\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\u6697\u800c\u4e0d\u660e\u90c1\u800c\u4e0d\u53d1\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u60b2\u592b\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\u5fc5\u4e0d\u5408\u77e3\u540e\u4e16\u4e4b\u5b66\u8005\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\u53e4\u4e4b\u5927\u4f53\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2'\n \n lm = load_lm('jiayan.klm')\n sentencizer = CRFSentencizer(lm)\n sentencizer.load('cut_model')\n print(sentencizer.sentencize(text))\n ```\n Result: \n `['\u5929\u4e0b\u5927\u4e71', '\u8d24\u5723\u4e0d\u660e', '\u9053\u5fb7\u4e0d\u4e00', '\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d', '\u8b6c\u5982\u8033\u76ee', '\u7686\u6709\u6240\u660e', '\u4e0d\u80fd\u76f8\u901a', '\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f', '\u7686\u6709\u6240\u957f', '\u65f6\u6709\u6240\u7528', '\u867d\u7136', '\u4e0d\u8be5\u4e0d\u904d', '\u4e00\u4e4b\u58eb\u4e5f', '\u5224\u5929\u5730\u4e4b\u7f8e', '\u6790\u4e07\u7269\u4e4b\u7406', '\u5bdf\u53e4\u4eba\u4e4b\u5168', '\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e', '\u79f0\u795e\u4e4b\u5bb9', '\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053', '\u6697\u800c\u4e0d\u660e', '\u90c1\u800c\u4e0d\u53d1', '\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9', '\u60b2\u592b', '\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd', '\u5fc5\u4e0d\u5408\u77e3', '\u540e\u4e16\u4e4b\u5b66\u8005', '\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf', '\u53e4\u4e4b\u5927\u4f53', '\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2']` \n\n5. __Punctuation__\n ```\n from jiayan import load_lm\n from jiayan import CRFPunctuator\n \n text = '\u5929\u4e0b\u5927\u4e71\u8d24\u5723\u4e0d\u660e\u9053\u5fb7\u4e0d\u4e00\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\u8b6c\u5982\u8033\u76ee\u7686\u6709\u6240\u660e\u4e0d\u80fd\u76f8\u901a\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\u7686\u6709\u6240\u957f\u65f6\u6709\u6240\u7528\u867d\u7136\u4e0d\u8be5\u4e0d\u904d\u4e00\u4e4b\u58eb\u4e5f\u5224\u5929\u5730\u4e4b\u7f8e\u6790\u4e07\u7269\u4e4b\u7406\u5bdf\u53e4\u4eba\u4e4b\u5168\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\u79f0\u795e\u4e4b\u5bb9\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\u6697\u800c\u4e0d\u660e\u90c1\u800c\u4e0d\u53d1\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\u60b2\u592b\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\u5fc5\u4e0d\u5408\u77e3\u540e\u4e16\u4e4b\u5b66\u8005\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\u53e4\u4e4b\u5927\u4f53\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2'\n \n lm = load_lm('jiayan.klm')\n punctuator = CRFPunctuator(lm, 'cut_model')\n punctuator.load('punc_model')\n print(punctuator.punctuate(text))\n ```\n Result: \n `\u5929\u4e0b\u5927\u4e71\uff0c\u8d24\u5723\u4e0d\u660e\uff0c\u9053\u5fb7\u4e0d\u4e00\uff0c\u5929\u4e0b\u591a\u5f97\u4e00\u5bdf\u7109\u4ee5\u81ea\u597d\uff0c\u8b6c\u5982\u8033\u76ee\uff0c\u7686\u6709\u6240\u660e\uff0c\u4e0d\u80fd\u76f8\u901a\uff0c\u72b9\u767e\u5bb6\u4f17\u6280\u4e5f\uff0c\u7686\u6709\u6240\u957f\uff0c\u65f6\u6709\u6240\u7528\uff0c\u867d\u7136\uff0c\u4e0d\u8be5\u4e0d\u904d\uff0c\u4e00\u4e4b\u58eb\u4e5f\uff0c\u5224\u5929\u5730\u4e4b\u7f8e\uff0c\u6790\u4e07\u7269\u4e4b\u7406\uff0c\u5bdf\u53e4\u4eba\u4e4b\u5168\uff0c\u5be1\u80fd\u5907\u4e8e\u5929\u5730\u4e4b\u7f8e\uff0c\u79f0\u795e\u4e4b\u5bb9\uff0c\u662f\u6545\u5185\u5723\u5916\u738b\u4e4b\u9053\uff0c\u6697\u800c\u4e0d\u660e\uff0c\u90c1\u800c\u4e0d\u53d1\uff0c\u5929\u4e0b\u4e4b\u4eba\u5404\u4e3a\u5176\u6240\u6b32\u7109\u4ee5\u81ea\u4e3a\u65b9\uff0c\u60b2\u592b\uff01\u767e\u5bb6\u5f80\u800c\u4e0d\u53cd\uff0c\u5fc5\u4e0d\u5408\u77e3\uff0c\u540e\u4e16\u4e4b\u5b66\u8005\uff0c\u4e0d\u5e78\u4e0d\u89c1\u5929\u5730\u4e4b\u7eaf\uff0c\u53e4\u4e4b\u5927\u4f53\uff0c\u9053\u672f\u5c06\u4e3a\u5929\u4e0b\u88c2\u3002`\n\n\n## Versions\n* v0.0.21\n * Divide the installation into two steps to ensure to get the latest version of kenlm. \n* v0.0.2\n * POS tagging feature is open.\n* v0.0.1\n * Add features of lexicon construction, tokenizing, sentence segmentation and automatic punctuation.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jiaeyan/Jiayan", "keywords": "classical-chinese,ancient-chinese,nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "jiayan", "package_url": "https://pypi.org/project/jiayan/", "platform": "", "project_url": "https://pypi.org/project/jiayan/", "project_urls": { "Homepage": "https://github.com/jiaeyan/Jiayan" }, "release_url": "https://pypi.org/project/jiayan/0.0.21/", "requires_dist": null, "requires_python": ">=2.6, >=3", "summary": "The NLP toolkit designed for classical chinese.", "version": "0.0.21" }, "last_serial": 5833477, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "3b619cbfc400c98bd10e5c70f8d9753a", "sha256": "1e55427c353b9a5075954e14cf6e937fdbad1b0867c8107b95200bfafef0af31" }, "downloads": -1, "filename": "jiayan-0.0.1-py3.6.egg", "has_sig": false, "md5_digest": "3b619cbfc400c98bd10e5c70f8d9753a", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": ">=2.6, >=3", "size": 248468, "upload_time": "2019-09-11T05:36:22", "url": "https://files.pythonhosted.org/packages/c9/39/cc1dfa934d0fa4c8146caf854a3836a01ac4be497f986adc0004f2d87c71/jiayan-0.0.1-py3.6.egg" }, { "comment_text": "", "digests": { "md5": "7287e919f9b701d160c45f23618438ea", "sha256": "2e7959944fbbdfdb42646ac88b0e17cbde7fec10552ebb619f3ba2ba5c9a368d" }, "downloads": -1, "filename": "jiayan-0.0.1.tar.gz", "has_sig": false, "md5_digest": "7287e919f9b701d160c45f23618438ea", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, >=3", "size": 211778, "upload_time": "2019-08-21T05:56:17", "url": "https://files.pythonhosted.org/packages/26/34/046a8161fc4ada0dedb3f246541d6ce9a4f0e80abf5fdd2b9bcf8eaf955f/jiayan-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "20a0384f015e82ba8b742f012c7253a5", "sha256": "48ad776bd201b9797e8b8b6b60ed3c570d3064809e3050a84648fc14143f166e" }, "downloads": -1, "filename": "jiayan-0.0.2.tar.gz", "has_sig": false, "md5_digest": "20a0384f015e82ba8b742f012c7253a5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, >=3", "size": 217104, "upload_time": "2019-09-11T05:36:25", "url": "https://files.pythonhosted.org/packages/34/9c/f994212663af76607a138c92470a2f3e09df2d0deda66f3fa1011476474d/jiayan-0.0.2.tar.gz" } ], "0.0.21": [ { "comment_text": "", "digests": { "md5": "748d1ccc1b7f569377936fd1d40b6279", "sha256": "c061077866d02e0bcb9b25cda65f34859e304eec94ad2e26d9e42088319d677b" }, "downloads": -1, "filename": "jiayan-0.0.21.tar.gz", "has_sig": false, "md5_digest": "748d1ccc1b7f569377936fd1d40b6279", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, >=3", "size": 217293, "upload_time": "2019-09-16T01:20:11", "url": "https://files.pythonhosted.org/packages/af/db/b49a4b0edc6be59b3fafb4c3bc953da039cd21959493ab7557e7a904c935/jiayan-0.0.21.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "748d1ccc1b7f569377936fd1d40b6279", "sha256": "c061077866d02e0bcb9b25cda65f34859e304eec94ad2e26d9e42088319d677b" }, "downloads": -1, "filename": "jiayan-0.0.21.tar.gz", "has_sig": false, "md5_digest": "748d1ccc1b7f569377936fd1d40b6279", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.6, >=3", "size": 217293, "upload_time": "2019-09-16T01:20:11", "url": "https://files.pythonhosted.org/packages/af/db/b49a4b0edc6be59b3fafb4c3bc953da039cd21959493ab7557e7a904c935/jiayan-0.0.21.tar.gz" } ] }