{ "info": { "author": "XuMing", "author_email": "xuming624@qq.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Natural Language :: Chinese (Simplified)", "Natural Language :: Chinese (Traditional)", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Text Processing", "Topic :: Text Processing :: Indexing", "Topic :: Text Processing :: Linguistic" ], "description": "![alt text](docs/logo.svg)\n\n[![PyPI version](https://badge.fury.io/py/pycorrector.svg)](https://badge.fury.io/py/pycorrector)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/shibing624/pycorrector/LICENSE)\n![Language](https://img.shields.io/badge/Language-Python-blue.svg)\n![Python3](https://img.shields.io/badge/Python-3.X-red.svg)\n\n\n# pycorrector\n\n\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u5de5\u5177\u3002\u97f3\u4f3c\u3001\u5f62\u4f3c\u9519\u5b57\uff08\u6216\u53d8\u4f53\u5b57\uff09\u7ea0\u6b63\uff0c\u53ef\u7528\u4e8e\u4e2d\u6587\u62fc\u97f3\u3001\u7b14\u753b\u8f93\u5165\u6cd5\u7684\u9519\u8bef\u7ea0\u6b63\u3002python3\u5f00\u53d1\u3002\n\n**pycorrector**\u4f9d\u636e\u8bed\u8a00\u6a21\u578b\u68c0\u6d4b\u9519\u522b\u5b57\u4f4d\u7f6e\uff0c\u901a\u8fc7\u62fc\u97f3\u97f3\u4f3c\u7279\u5f81\u3001\u7b14\u753b\u4e94\u7b14\u7f16\u8f91\u8ddd\u79bb\u7279\u5f81\u53ca\u8bed\u8a00\u6a21\u578b\u56f0\u60d1\u5ea6\u7279\u5f81\u7ea0\u6b63\u9519\u522b\u5b57\u3002\n\n## \u95ee\u9898\n\n\u4e2d\u6587\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u5e38\u89c1\u9519\u8bef\u7c7b\u578b\u5305\u62ec\uff1a\n\n- \u8c10\u97f3\u5b57\u8bcd\uff0c\u5982 \u914d\u526f\u773c\u775b-\u914d\u526f\u773c\u955c\n- \u6df7\u6dc6\u97f3\u5b57\u8bcd\uff0c\u5982 \u6d41\u6d6a\u7ec7\u5973-\u725b\u90ce\u7ec7\u5973\n- \u5b57\u8bcd\u987a\u5e8f\u98a0\u5012\uff0c\u5982 \u4f0d\u8fea\u827e\u4f26-\u827e\u4f26\u4f0d\u8fea\n- \u5b57\u8bcd\u8865\u5168\uff0c\u5982 \u7231\u6709\u5929\u610f-\u5047\u5982\u7231\u6709\u5929\u610f\n- \u5f62\u4f3c\u5b57\u9519\u8bef\uff0c\u5982 \u9ad8\u6881-\u9ad8\u7cb1\n- \u4e2d\u6587\u62fc\u97f3\u5168\u62fc\uff0c\u5982 xingfu-\u5e78\u798f\n- \u4e2d\u6587\u62fc\u97f3\u7f29\u5199\uff0c\u5982 sz-\u6df1\u5733\n- \u8bed\u6cd5\u9519\u8bef\uff0c\u5982 \u60f3\u8c61\u96be\u4ee5-\u96be\u4ee5\u60f3\u8c61\n\n\u5f53\u7136\uff0c\u9488\u5bf9\u4e0d\u540c\u4e1a\u52a1\u573a\u666f\uff0c\u8fd9\u4e9b\u95ee\u9898\u5e76\u4e0d\u4e00\u5b9a\u5168\u90e8\u5b58\u5728\uff0c\u6bd4\u5982\u8f93\u5165\u6cd5\u4e2d\u9700\u8981\u5904\u7406\u524d\u56db\u79cd\uff0c\u641c\u7d22\u5f15\u64ce\u9700\u8981\u5904\u7406\u6240\u6709\u7c7b\u578b\uff0c\u8bed\u97f3\u8bc6\u522b\u540e\u6587\u672c\u7ea0\u9519\u53ea\u9700\u8981\u5904\u7406\u524d\u4e24\u79cd\uff0c\n\u5176\u4e2d'\u5f62\u4f3c\u5b57\u9519\u8bef'\u4e3b\u8981\u9488\u5bf9\u4e94\u7b14\u6216\u8005\u7b14\u753b\u624b\u5199\u8f93\u5165\u7b49\u3002\n\n\n## \u89e3\u51b3\u65b9\u6848\n### \u89c4\u5219\u7684\u89e3\u51b3\u601d\u8def\n1. \u4e2d\u6587\u7ea0\u9519\u5206\u4e3a\u4e24\u6b65\u8d70\uff0c\u7b2c\u4e00\u6b65\u662f\u9519\u8bef\u68c0\u6d4b\uff0c\u7b2c\u4e8c\u6b65\u662f\u9519\u8bef\u7ea0\u6b63\uff1b\n2. \u9519\u8bef\u68c0\u6d4b\u90e8\u5206\u5148\u901a\u8fc7\u7ed3\u5df4\u4e2d\u6587\u5206\u8bcd\u5668\u5207\u8bcd\uff0c\u7531\u4e8e\u53e5\u5b50\u4e2d\u542b\u6709\u9519\u522b\u5b57\uff0c\u6240\u4ee5\u5207\u8bcd\u7ed3\u679c\u5f80\u5f80\u4f1a\u6709\u5207\u5206\u9519\u8bef\u7684\u60c5\u51b5\uff0c\u8fd9\u6837\u4ece\u5b57\u7c92\u5ea6\u548c\u8bcd\u7c92\u5ea6\u4e24\u65b9\u9762\u68c0\u6d4b\u9519\u8bef\uff0c\n\u6574\u5408\u8fd9\u4e24\u79cd\u7c92\u5ea6\u7684\u7591\u4f3c\u9519\u8bef\u7ed3\u679c\uff0c\u5f62\u6210\u7591\u4f3c\u9519\u8bef\u4f4d\u7f6e\u5019\u9009\u96c6\uff1b\n3. \u9519\u8bef\u7ea0\u6b63\u90e8\u5206\uff0c\u662f\u904d\u5386\u6240\u6709\u7684\u7591\u4f3c\u9519\u8bef\u4f4d\u7f6e\uff0c\u5e76\u4f7f\u7528\u97f3\u4f3c\u3001\u5f62\u4f3c\u8bcd\u5178\u66ff\u6362\u9519\u8bef\u4f4d\u7f6e\u7684\u8bcd\uff0c\u7136\u540e\u901a\u8fc7\u8bed\u8a00\u6a21\u578b\u8ba1\u7b97\u53e5\u5b50\u56f0\u60d1\u5ea6\uff0c\u5bf9\u6240\u6709\u5019\u9009\u96c6\u7ed3\u679c\u6bd4\u8f83\u5e76\u6392\u5e8f\uff0c\u5f97\u5230\u6700\u4f18\u7ea0\u6b63\u8bcd\u3002\n\n### \u6df1\u5ea6\u6a21\u578b\u7684\u89e3\u51b3\u601d\u8def\n1. \u7aef\u5230\u7aef\u7684\u6df1\u5ea6\u6a21\u578b\u53ef\u4ee5\u907f\u514d\u4eba\u5de5\u63d0\u53d6\u7279\u5f81\uff0c\u51cf\u5c11\u4eba\u5de5\u5de5\u4f5c\u91cf\uff0cRNN\u5e8f\u5217\u6a21\u578b\u5bf9\u6587\u672c\u4efb\u52a1\u62df\u5408\u80fd\u529b\u5f3a\uff0crnn_attention\u5728\u82f1\u6587\u6587\u672c\u7ea0\u9519\u6bd4\u8d5b\u4e2d\u53d6\u5f97\u7b2c\u4e00\u540d\u6210\u7ee9\uff0c\u8bc1\u660e\u5e94\u7528\u6548\u679c\u4e0d\u9519\uff1b\n2. CRF\u4f1a\u8ba1\u7b97\u5168\u5c40\u6700\u4f18\u8f93\u51fa\u8282\u70b9\u7684\u6761\u4ef6\u6982\u7387\uff0c\u5bf9\u53e5\u5b50\u4e2d\u7279\u5b9a\u9519\u8bef\u7c7b\u578b\u7684\u68c0\u6d4b\uff0c\u4f1a\u6839\u636e\u6574\u53e5\u8bdd\u5224\u5b9a\u8be5\u9519\u8bef\uff0c\u963f\u91cc\u53c2\u8d5b2016\u4e2d\u6587\u8bed\u6cd5\u7ea0\u9519\u4efb\u52a1\u5e76\u53d6\u5f97\u7b2c\u4e00\u540d\uff0c\u8bc1\u660e\u5e94\u7528\u6548\u679c\u4e0d\u9519\uff1b\n3. seq2seq\u6a21\u578b\u662f\u4f7f\u7528encoder-decoder\u7ed3\u6784\u89e3\u51b3\u5e8f\u5217\u8f6c\u6362\u95ee\u9898\uff0c\u76ee\u524d\u5728\u5e8f\u5217\u8f6c\u6362\u4efb\u52a1\u4e2d\uff08\u5982\u673a\u5668\u7ffb\u8bd1\u3001\u5bf9\u8bdd\u751f\u6210\u3001\u6587\u672c\u6458\u8981\u3001\u56fe\u50cf\u63cf\u8ff0\uff09\u4f7f\u7528\u6700\u5e7f\u6cdb\u3001\u6548\u679c\u6700\u597d\u7684\u6a21\u578b\u4e4b\u4e00\u3002\n\n\n## \u7279\u5f81\n### \u6a21\u578b\n* kenlm\uff1akenlm\u7edf\u8ba1\u8bed\u8a00\u6a21\u578b\u5de5\u5177\n* rnn_lm\uff1aTensorFlow\u3001PaddlePaddle\u5747\u6709\u5b9e\u73b0\u6808\u5f0f\u53cc\u5411LSTM\u7684\u8bed\u8a00\u6a21\u578b\n* rnn_attention\u6a21\u578b\uff1a\u53c2\u8003Stanford University\u7684nlc\u6a21\u578b\uff0c\u8be5\u6a21\u578b\u662f\u53c2\u52a02014\u82f1\u6587\u6587\u672c\u7ea0\u9519\u6bd4\u8d5b\u5e76\u53d6\u5f97\u7b2c\u4e00\u540d\u7684\u65b9\u6cd5\n* rnn_crf\u6a21\u578b\uff1a\u53c2\u8003\u963f\u91cc\u5df4\u5df42016\u53c2\u8d5b\u4e2d\u6587\u8bed\u6cd5\u7ea0\u9519\u6bd4\u8d5b\u5e76\u53d6\u5f97\u7b2c\u4e00\u540d\u7684\u65b9\u6cd5\n* seq2seq\u6a21\u578b\uff1a\u4f7f\u7528\u5e8f\u5217\u6a21\u578b\u89e3\u51b3\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u6587\u672c\u8bed\u6cd5\u7ea0\u9519\u4efb\u52a1\u4e2d\u5e38\u7528\u6a21\u578b\u4e4b\u4e00\n* seq2seq_attention\u6a21\u578b\uff1a\u5728seq2seq\u6a21\u578b\u52a0\u4e0aattention\u673a\u5236\uff0c\u5bf9\u4e8e\u957f\u6587\u672c\u6548\u679c\u66f4\u597d\uff0c\u6a21\u578b\u66f4\u5bb9\u6613\u6536\u655b\uff0c\u4f46\u5bb9\u6613\u8fc7\u62df\u5408\n* transformer\u6a21\u578b\uff1a\u5168attention\u7684\u7ed3\u6784\u4ee3\u66ff\u4e86lstm\u7528\u4e8e\u89e3\u51b3sequence to sequence\u95ee\u9898\uff0c\u8bed\u4e49\u7279\u5f81\u63d0\u53d6\u6548\u679c\u66f4\u597d\n* bert\u6a21\u578b\uff1a\u4e2d\u6587fine-tuned\u6a21\u578b\uff0c\u4f7f\u7528MASK\u7279\u5f81\u7ea0\u6b63\u9519\u5b57\n\n\n### \u9519\u8bef\u68c0\u6d4b\n* \u5b57\u7c92\u5ea6\uff1a\u8bed\u8a00\u6a21\u578b\u56f0\u60d1\u5ea6\uff08ppl\uff09\u68c0\u6d4b\u67d0\u5b57\u7684\u4f3c\u7136\u6982\u7387\u503c\u4f4e\u4e8e\u53e5\u5b50\u6587\u672c\u5e73\u5747\u503c\uff0c\u5219\u5224\u5b9a\u8be5\u5b57\u662f\u7591\u4f3c\u9519\u522b\u5b57\u7684\u6982\u7387\u5927\u3002\n* \u8bcd\u7c92\u5ea6\uff1a\u5207\u8bcd\u540e\u4e0d\u5728\u8bcd\u5178\u4e2d\u7684\u8bcd\u662f\u7591\u4f3c\u9519\u8bcd\u7684\u6982\u7387\u5927\u3002\n\n\n### \u9519\u8bef\u7ea0\u6b63\n* \u901a\u8fc7\u9519\u8bef\u68c0\u6d4b\u5b9a\u4f4d\u6240\u6709\u7591\u4f3c\u9519\u8bef\u540e\uff0c\u53d6\u6240\u6709\u7591\u4f3c\u9519\u5b57\u7684\u97f3\u4f3c\u3001\u5f62\u4f3c\u5019\u9009\u8bcd\uff0c\n* \u4f7f\u7528\u5019\u9009\u8bcd\u66ff\u6362\uff0c\u57fa\u4e8e\u8bed\u8a00\u6a21\u578b\u5f97\u5230\u7c7b\u4f3c\u7ffb\u8bd1\u6a21\u578b\u7684\u5019\u9009\u6392\u5e8f\u7ed3\u679c\uff0c\u5f97\u5230\u6700\u4f18\u7ea0\u6b63\u8bcd\u3002\n\n\n### \u601d\u8003\n1. \u73b0\u5728\u7684\u5904\u7406\u624b\u6bb5\uff0c\u5728\u8bcd\u7c92\u5ea6\u7684\u9519\u8bef\u53ec\u56de\u8fd8\u4e0d\u9519\uff0c\u4f46\u9519\u8bef\u7ea0\u6b63\u7684\u51c6\u786e\u7387\u8fd8\u6709\u5f85\u63d0\u9ad8\uff0c\u66f4\u591a\u4f18\u8d28\u7684\u7ea0\u9519\u96c6\u53ca\u7ea0\u9519\u8bcd\u5e93\u4f1a\u6709\u63d0\u5347\uff0c\u6211\u66f4\u5e0c\u671b\u7b97\u6cd5\u4e0a\u6709\u66f4\u5927\u7684\u7a81\u7834\u3002\n2. \u53e6\u5916\uff0c\u73b0\u5728\u7684\u6587\u672c\u9519\u8bef\u4e0d\u518d\u5c40\u9650\u4e8e\u5b57\u8bcd\u7c92\u5ea6\u4e0a\u7684\u62fc\u5199\u9519\u8bef\uff0c\u9700\u8981\u63d0\u9ad8\u4e2d\u6587\u8bed\u6cd5\u9519\u8bef\u68c0\u6d4b\uff08CGED, Chinese Grammar Error Diagnosis\uff09\u53ca\u7ea0\u6b63\u80fd\u529b\uff0c\u5217\u5728TODO\u4e2d\uff0c\u540e\u7eed\u8c03\u7814\u3002\n\n## demo\n\nhttps://www.borntowin.cn/product/corrector\n\n\n## \u5b89\u88c5\n* \u5168\u81ea\u52a8\u5b89\u88c5\uff1apip3 install pycorrector\n* \u534a\u81ea\u52a8\u5b89\u88c5\uff1a\n```\ngit clone https://github.com/shibing624/pycorrector.git\ncd pycorrector\npython3 setup.py install\n```\n\n## \u89c4\u5219\u65b9\u6848\u4f7f\u7528\u8bf4\u660e\n\n\n### \u5b89\u88c5\u4f9d\u8d56\n```\npip3 install -r requirements.txt\n```\n\n### \u7ea0\u9519 \n\u4f7f\u7528\u793a\u4f8b:\n```\nimport pycorrector\n\ncorrected_sent, detail = pycorrector.correct('\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750')\nprint(corrected_sent, detail)\n\n```\n\n\u8f93\u51fa:\n```\n\u5c11\u5148\u961f\u5458\u5e94\u8be5\u4e3a\u8001\u4eba\u8ba9\u5ea7 [[('\u56e0\u8be5', '\u5e94\u8be5', 4, 6)], [('\u5750', '\u5ea7', 10, 11)]]\n```\n\n\n## \u6df1\u5ea6\u65b9\u6848\u4f7f\u7528\u8bf4\u660e\n\n### \u5b89\u88c5\u4f9d\u8d56\n```\npip3 install -r requirements-dev.txt\n\npip3 install git+https://www.github.com/keras-team/keras-contrib.git\n```\n\n### \u4ecb\u7ecd\n\n\u672c\u9879\u76ee\u7684\u521d\u8877\u4e4b\u4e00\u662f\u6bd4\u5bf9\u3001\u5171\u4eab\u5404\u79cd\u6587\u672c\u7ea0\u9519\u65b9\u6cd5\uff0c\u629b\u7816\u5f15\u7389\u7684\u4f5c\u7528\uff0c\u5982\u679c\u5bf9\u5927\u5bb6\u5728\u6587\u672c\u7ea0\u9519\u4efb\u52a1\u4e0a\u6709\u4e00\u70b9\u5c0f\u5c0f\u7684\u542f\u53d1\u5c31\u662f\u6211\u83ab\u5927\u7684\u8363\u5e78\u4e86\u3002\n\n\u4e3b\u8981\u4f7f\u7528\u4e86\u591a\u79cd\u6df1\u5ea6\u6a21\u578b\u5e94\u7528\u4e8e\u6587\u672c\u7ea0\u9519\u4efb\u52a1\uff0c\u5206\u522b\u662f\u524d\u9762`\u6a21\u578b`\u5c0f\u8282\u4ecb\u7ecd\u7684`rnn_attention`\u3001`rnn_crf`\u3001`conv_seq2seq`\u3001`seq2seq_attention`\u3001\n`transformer`\u3001`bert`\uff0c\u5404\u6a21\u578b\u65b9\u6cd5\u5185\u7f6e\u4e8e`pycorrector`\u6587\u4ef6\u5939\u4e0b\uff0c\u6709`README.md`\u8be6\u7ec6\u6307\u5bfc\uff0c\u5404\u6a21\u578b\u53ef\u72ec\u7acb\u8fd0\u884c\uff0c\u76f8\u4e92\u4e4b\u95f4\u65e0\u4f9d\u8d56\u3002\n\n\n### \u4f7f\u7528\u65b9\u6cd5\n\u5404\u6a21\u578b\u5747\u53ef\u72ec\u7acb\u7684\u9884\u5904\u7406\u6570\u636e\u3001\u8bad\u7ec3\u3001\u9884\u6d4b\uff0c\u4e0b\u9762\u4ee5\u5176\u4e2d`seq2seq_attention`\u4e3a\u4f8b\uff1a\n\nseq2seq_attention \u6a21\u578b\u4f7f\u7528\u793a\u4f8b:\n\n#### \u914d\u7f6e\n\n\u901a\u8fc7\u4fee\u6539`config.py`\u3002\n\n\n#### \u6570\u636e\u9884\u5904\u7406\n```\ncd seq2seq_attention\n# \u6570\u636e\u9884\u5904\u7406\npython preprocess.py\n\n```\n\u81ea\u52a8\u65b0\u5efa\u6587\u4ef6\u5939output\uff0c\u5728output\u4e0b\u751f\u6210`train.txt`\u548c`test.txt`\u6587\u4ef6\uff0c\u4ee5TAB\uff08\"\\t\"\uff09\u95f4\u9694\u9519\u8bef\u6587\u672c\u548c\u7ea0\u6b63\u6587\u672c\uff0c\u6587\u4ef6\u5185\u5bb9\u793a\u4f8b\uff1a\n\n```\n\u636e\u79d1\u5b66\u7406\u8bba\uff0c\u88ab\u52a8\u5438\u70df\u8005\u7684\u5371\u5bb3\u6bd4\u5438\u70df\u8005\u66f4\u5389\u5bb3\u3002\t\u636e\u79d1\u5b66\u7406\u8bba\uff0c\u88ab\u52a8\u5438\u70df\u8005\u53d7\u5230\u7684\u5371\u5bb3\u6bd4\u5438\u70df\u8005\u66f4\u5389\u5bb3\u3002\n\u5e0c\u671b\u5c11\u5438\u70df\u3002\t\u5e0c\u671b\u70df\u6c11\u4eec\u5c11\u5438\u70df\u3002\n\u4f46\u5176\u5b9e\u7981\u70df\u8fd9\u79cd\u4e8b\u60c5\u975e\u5e38\u96be\u3002\t\u4f46\u7981\u70df\u8fd9\u79cd\u4e8b\u60c5\u5176\u5b9e\u975e\u5e38\u96be\u3002\n```\n\n\n#### \u8bad\u7ec3\n```\npython train.py\n```\n\u8bad\u7ec3\u8fc7\u7a0b\u622a\u56fe\uff1a\n![train image](https://github.com/shibing624/pycorrector/blob/master/pycorrector/data/git_image/seq2seq_train.png)\n\n\n#### \u9884\u6d4b\n```\npython infer.py\n```\n\n\n\u9884\u6d4b\u8f93\u51fa\u6548\u679c\u6837\u4f8b\uff1a\n```\ninput: \u5c11\u5148\u961f\u5458\u56e0\u8be5\u7ed9\u8001\u4eba\u8ba9\u5750 output: \u5c11\u5148\u961f\u5458\u56e0\u8be5\u7ed9\u8001\u4eba\u8ba9\u5ea7\ninput: \u5c11\u5148\u961f\u5458\u5e94\u8be5\u7ed9\u8001\u4eba\u8ba9\u5750 output: \u5c11\u5148\u961f\u5458\u5e94\u8be5\u7ed9\u8001\u4eba\u8ba9\u5ea7\ninput: \u6ca1\u6709\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\uff0c output: \u6ca1\u6709\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\uff0c\uff0c\ninput: \u7531\u6211\u8d77\u5f00\u59cb\u505a\u3002 output: \u7531\u6211\u8d77\u5f00\u59cb\u505a\ninput: \u7531\u6211\u8d77\u5f00\u59cb\u505a output: \u7531\u6211\u5f00\u59cb\u505a\n\n```\n\n\n## \u81ea\u5b9a\u4e49\u8bed\u8a00\u6a21\u578b\n\n\u8bed\u8a00\u6a21\u578b\u5bf9\u4e8e\u7ea0\u9519\u6b65\u9aa4\u81f3\u5173\u91cd\u8981\uff0c\u76ee\u524d\u6211\u80fd\u6536\u96c6\u5230\u7684\u8bed\u6599\u6570\u636e\u6709\u4eba\u6c11\u65e5\u62a5\u6570\u636e\u3002\u5927\u5bb6\u53ef\u4ee5\u7528\u4e2d\u6587\u7ef4\u57fa\uff08\u7e41\u4f53\u8f6c\u7b80\u4f53\uff0cpycorrector.utils\u4e0b\u6709\u6b64\u529f\u80fd\uff09\u7b49\u66f4\u5927\u7684\u8bed\u6599\u6570\u636e\u8bad\u7ec3\u6548\u679c\u66f4\u597d\u7684\u8bed\u8a00\u6a21\u578b\uff0c\n\u5bf9\u4e8e\u7ea0\u9519\u6548\u679c\u4f1a\u6709\u6bd4\u8f83\u597d\u7684\u63d0\u5347\u3002\n\n1. kenlm\u8bed\u8a00\u6a21\u578b\u8bad\u7ec3\u5de5\u5177\u7684\u4f7f\u7528\uff0c\u8bf7\u89c1\u535a\u5ba2\uff1ahttp://blog.csdn.net/mingzai624/article/details/79560063\n2. \u9644\u4e0a\u8bad\u7ec3\u8bed\u6599<\u4eba\u6c11\u65e5\u62a52014\u7248\u719f\u8bed\u6599>\uff0c\u5305\u62ec\uff1a\n 1\uff09\u6807\u51c6\u4eba\u5de5\u5207\u8bcd\u53ca\u8bcd\u6027\u6570\u636epeople2014.tar.gz\uff0c\n 2\uff09\u672a\u5207\u8bcd\u6587\u672c\u6570\u636epeople2014_words.txt\uff0c\n 3\uff09kenlm\u8bad\u7ec3\u5b57\u7c92\u5ea6\u8bed\u8a00\u6a21\u578b\u6587\u4ef6\u53ca\u5176\u4e8c\u8fdb\u5236\u6587\u4ef6people2014corpus_chars.arps/klm\uff0c\n 4\uff09kenlm\u8bcd\u7c92\u5ea6\u8bed\u8a00\u6a21\u578b\u6587\u4ef6\u53ca\u5176\u4e8c\u8fdb\u5236\u6587\u4ef6people2014corpus_words.arps/klm\u3002\n\n\u7f51\u76d8\u94fe\u63a5:https://pan.baidu.com/s/1971a5XLQsIpL0zL0zxuK2A \u5bc6\u7801:uc11\u3002\u5c0a\u91cd\u7248\u6743\uff0c\u4f20\u64ad\u8bf7\u6ce8\u660e\u51fa\u5904\u3002\n\n## \u4e2d\u6587\u7ea0\u9519\u6570\u636e\u96c6\n1. NLPCC 2018 GEC\u5b98\u65b9\u6570\u636e\u96c6[NLPCC2018-GEC](http://tcci.ccf.org.cn/conference/2018/taskdata.php)\uff0c\n\u8bad\u7ec3\u96c6[trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)[114.5MB]\uff0c\u8be5\u6570\u636e\u683c\u5f0f\u662f\u539f\u59cb\u6587\u672c\uff0c\u672a\u505a\u5207\u8bcd\u5904\u7406\u3002\n2. \u6c49\u8bed\u6c34\u5e73\u8003\u8bd5\uff08HSK\uff09\u548clang8\u539f\u59cb\u5e73\u884c\u8bed\u6599[HSK+Lang8](https://pan.baidu.com/s/18JXm1KGmRu3Pe45jt2sYBQ)[190MB]\uff0c\u8be5\u6570\u636e\u96c6\u5df2\u7ecf\u5207\u8bcd\uff0c\u53ef\u7528\u4f5c\u6570\u636e\u6269\u589e\n3. \u4ee5\u4e0a\u8bed\u6599\uff0c\u518d\u52a0\u4e0aCGED16\u3001CGED17\u3001CGED18\u7684\u6570\u636e\uff0c\u7ecf\u8fc7\u4ee5\u5b57\u5207\u5206\uff0c\u7e41\u4f53\u8f6c\u7b80\u4f53\uff0c\u6253\u4e71\u6570\u636e\u987a\u5e8f\u7684\u9884\u5904\u7406\u540e\uff0c\u751f\u6210\u7528\u4e8e\u7ea0\u9519\u7684\u719f\u8bed\u6599(nlpcc2018+hsk)\uff0c\u7f51\u76d8\u94fe\u63a5:https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA \u5bc6\u7801:m6fg [130\u4e07\u5bf9\u53e5\u5b50\uff0c215MB]\n\n## \u8d21\u732e\u53ca\u4f18\u5316\u70b9\n\n- [x] \u4f7f\u7528RNN\u8bed\u8a00\u6a21\u578b\u6765\u63d0\u9ad8\u7ea0\u9519\u51c6\u786e\u7387\n- [x] \u4f18\u5316\u5f62\u4f3c\u5b57\u5b57\u5178\uff0c\u63d0\u9ad8\u5f62\u4f3c\u5b57\u7ea0\u9519\u51c6\u786e\u7387\n- [x] \u6574\u7406\u4e2d\u6587\u7ea0\u9519\u8bad\u7ec3\u6570\u636e\uff0c\u4f7f\u7528seq2seq\u505a\u6df1\u5ea6\u4e2d\u6587\u7ea0\u9519\u6a21\u578b\n- [x] \u6dfb\u52a0\u4e2d\u6587\u8bed\u6cd5\u9519\u8bef\u68c0\u6d4b\u53ca\u7ea0\u6b63\u80fd\u529b\n- [x] \u89c4\u5219\u65b9\u6cd5\u6dfb\u52a0\u7528\u6237\u81ea\u5b9a\u4e49\u7ea0\u9519\u96c6\uff0c\u5e76\u5c06\u5176\u7ea0\u9519\u4f18\u5148\u5ea6\u8c03\u4e3a\u6700\u9ad8\n- [x] seq2seq_attention \u6dfb\u52a0dropout\uff0c\u51cf\u5c11\u8fc7\u62df\u5408\n- [x] \u5728seq2seq\u6a21\u578b\u6846\u67b6\u4e0a\uff0c\u65b0\u589ePointer-generator network\u3001Beam search\u3001Unknown words replacement\u3001Coverage mechanism\u7b49\u7279\u6027\n\n\n## \u53c2\u8003\n\n1. [\u57fa\u4e8e\u6587\u6cd5\u6a21\u578b\u7684\u4e2d\u6587\u7ea0\u9519\u7cfb\u7edf](https://blog.csdn.net/mingzai624/article/details/82390382)\n2. [Norvig\u2019s spelling corrector](http://norvig.com/spell-correct.html)\n3. [\u300aChinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape\u300b[Yu, 2013]](http://www.aclweb.org/anthology/W/W14/W14-6835.pdf)\n4. [\u300aChinese Spelling Checker Based on Statistical Machine Translation\u300b[Chiu, 2013]](http://www.aclweb.org/anthology/O/O13/O13-1005.pdf)\n5. [\u300aChinese Word Spelling Correction Based on Rule Induction\u300b[yeh, 2014]](http://aclweb.org/anthology/W14-6822)\n6. [\u300aNeural Language Correction with Character-Based Attention\u300b[Ziang Xie, 2016]](https://arxiv.org/pdf/1603.09727.pdf)\n7. [\u300aChinese Spelling Check System Based on Tri-gram Model\u300b[Qiang Huang, 2014]](http://www.anthology.aclweb.org/W/W14/W14-6827.pdf)\n8. [\u300aNeural Abstractive Text Summarization with Sequence-to-Sequence Models\u300b[Tian Shi, 2018]](https://arxiv.org/abs/1812.02303)\n9. [\u300a\u57fa\u4e8e\u6df1\u5ea6\u5b66\u4e60\u7684\u4e2d\u6587\u6587\u672c\u81ea\u52a8\u6821\u5bf9\u7814\u7a76\u4e0e\u5b9e\u73b0\u300b[\u6768\u5b97\u9716, 2019]](https://github.com/shibing624/pycorrector/blob/master/docs/\u57fa\u4e8e\u6df1\u5ea6\u5b66\u4e60\u7684\u4e2d\u6587\u6587\u672c\u81ea\u52a8\u6821\u5bf9\u7814\u7a76\u4e0e\u5b9e\u73b0.pdf)\n\n----\n\n# pycorrector\nChinese text error correction tool. \n\n\n**pycorrector** Use the language model to detect errors, pinyin feature and shape feature to correct chinese text \nerror, it can be used for Chinese Pinyin and stroke input method.\n\n## Features\n### language model\n* Kenlm\n* RNNLM\n\n## Usage\n\n### install\n* pip install pycorrector / pip3 install pycorrector \n* Or download https://github.com/shibing624/pycorrector, Unzip and run: python setup.py install\n\n### correct \ninput:\n```\nimport pycorrector\n\ncorrected_sent, detail = pycorrector.correct('\u5c11\u5148\u961f\u5458\u56e0\u8be5\u4e3a\u8001\u4eba\u8ba9\u5750')\nprint(corrected_sent, detail)\n\n```\n\noutput:\n```\n\u5c11\u5148\u961f\u5458\u5e94\u8be5\u4e3a\u8001\u4eba\u8ba9\u5ea7 [[('\u56e0\u8be5', '\u5e94\u8be5', 4, 6)], [('\u5750', '\u5ea7', 10, 11)]]\n```\n\n\n### Future work\n1. P(c), the language model. We could create a better language model by collecting more data, and perhaps by using a \nlittle English morphology (such as adding \"ility\" or \"able\" to the end of a word).\n\n2. P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the \nerror.\nClearly we could use a better model of the cost of edits. get a corpus of spelling errors, and count how likely it is\nto make each insertion, deletion, or alteration, given the surrounding characters. \n\n3. It turns out that in many cases it is difficult to make a decision based only on a single word. This is most \nobvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another \nword anyway:\ncorrection('where') => 'where' (123); expected 'were' (452)\nWe can't possibly know that correction('where') should be 'were' in at least one case, but should remain 'where' in \nother cases. But if the query had been correction('They where going') then it seems likely that \"where\" should be \ncorrected to \"were\".\n\n4. Finally, we could improve the implementation by making it much faster, without changing the results. We could \nre-implement in a compiled language rather than an interpreted one. We could cache the results of computations so \nthat we don't have to repeat them multiple times. \nOne word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually \ngoing.\n\n\n### Further Reading\n* [Roger Mitton has a survey article on spell checking.](http://www.dcs.bbk.ac.uk/~roger/spellchecking.html)\n\n# Reference\n1. [Norvig\u2019s spelling corrector](http://norvig.com/spell-correct.html)\n2. [Norvig\u2019s spelling corrector(java version)](http://raelcunha.com/spell-correct/)", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/shibing624/pycorrector", "keywords": "NLP,correction,Chinese error corrector,corrector", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "pycorrector", "package_url": "https://pypi.org/project/pycorrector/", "platform": "", "project_url": "https://pypi.org/project/pycorrector/", "project_urls": { "Homepage": "https://github.com/shibing624/pycorrector" }, "release_url": "https://pypi.org/project/pycorrector/0.1.9/", "requires_dist": null, "requires_python": "", "summary": "Chinese Text Error Corrector", "version": "0.1.9" }, "last_serial": 5628240, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "30a6d22c474dd7db3c8966f8888f86d8", "sha256": "0953a7225720d1dc2c5d4830f21245b606ffde5702ea87ce11ef9bd0c72d2ffd" }, "downloads": -1, "filename": "pycorrector-0.0.1.tar.gz", "has_sig": false, "md5_digest": "30a6d22c474dd7db3c8966f8888f86d8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2969, "upload_time": "2018-03-06T05:40:04", "url": "https://files.pythonhosted.org/packages/c8/60/3eb10dd5f380b28cb2d0b17531567c6dfc74fd06306a5d71d5ac981e49fa/pycorrector-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "832024085b067487ee2340d7feb99fae", "sha256": "5d267cee7d1f417e43226a705c8c6a48cb7eea528d5f1dd34635a30a7f1714a0" }, "downloads": -1, "filename": "pycorrector-0.0.2.tar.gz", "has_sig": false, "md5_digest": "832024085b067487ee2340d7feb99fae", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2794, "upload_time": "2018-03-06T06:33:02", "url": "https://files.pythonhosted.org/packages/e4/b8/bf4496e132d4bce5213360a119040fa0e47e498ccbf04261e46fb7e7d0a5/pycorrector-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "bfe1828f5403fef3b4da3d408a206f87", "sha256": "b0e8f08f6bebe4a74cb30f6bfc105f55144bf6f637dd4dd5858f3496db2dc637" }, "downloads": -1, "filename": "pycorrector-0.0.3.tar.gz", "has_sig": false, "md5_digest": "bfe1828f5403fef3b4da3d408a206f87", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1980562, "upload_time": "2018-03-06T06:57:14", "url": "https://files.pythonhosted.org/packages/44/f8/99ec0cac03af189ccfb11067e5808f302541fc25c46d684e85e997fd34a9/pycorrector-0.0.3.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "5a7141ed5eb9f7b88fc99b4050ccf5c5", "sha256": "c0ee553cd589c4e635838e73bce855a0dde6967ada3ee36f57abf2ddd8c32e71" }, "downloads": -1, "filename": "pycorrector-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5a7141ed5eb9f7b88fc99b4050ccf5c5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12076, "upload_time": "2018-03-14T06:56:53", "url": "https://files.pythonhosted.org/packages/b0/ea/28140921892124bdc955b2d68c53ff7978c222332985d1068b74e4c0ebc4/pycorrector-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "a4cb6987ce49e8087589d44842b5b875", "sha256": "e29c14651044ba9b3ddb6510b267c6824030c58aa8806ac2ebc15d43105c2d2d" }, "downloads": -1, "filename": "pycorrector-0.1.1.tar.gz", "has_sig": false, "md5_digest": "a4cb6987ce49e8087589d44842b5b875", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21284455, "upload_time": "2018-03-14T07:03:02", "url": "https://files.pythonhosted.org/packages/1a/06/a0e2f568bfcf0796b2ed49843f21553bf4ff73096c5f307bf695a89d6d15/pycorrector-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "69d91544576e4eceaf46b60c976a3eb2", "sha256": "21e5869bd2818ae193aff233487cfa0feab213d496ca6e406aaf829ddac0ff1b" }, "downloads": -1, "filename": "pycorrector-0.1.2.tar.gz", "has_sig": false, "md5_digest": "69d91544576e4eceaf46b60c976a3eb2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21282029, "upload_time": "2018-03-14T07:13:21", "url": "https://files.pythonhosted.org/packages/f1/b6/4813fd7e1e3b48d45d45658c237848cd6f08af5d13ac14bbf5afc4a957fd/pycorrector-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "8e47a6e7777155e5ffeecf6568445b3d", "sha256": "e13cc9e58d76b044bd795d345ab643b19bc8da5e9c711e45abdb1111f9d5e087" }, "downloads": -1, "filename": "pycorrector-0.1.3.tar.gz", "has_sig": false, "md5_digest": "8e47a6e7777155e5ffeecf6568445b3d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20428652, "upload_time": "2018-08-29T12:23:18", "url": "https://files.pythonhosted.org/packages/33/c6/a5754eee5df764d020bec144d69a7d4d778ca43ce87a4955029c1ef7d6d2/pycorrector-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "26e0726325cf7f933de9edb6a6d3d9c6", "sha256": "ef6dd1b681f3f94d9ca936bc014abf4cd4d034d5c8dbd04a1bc578ebf866d2bc" }, "downloads": -1, "filename": "pycorrector-0.1.4.tar.gz", "has_sig": false, "md5_digest": "26e0726325cf7f933de9edb6a6d3d9c6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16390001, "upload_time": "2018-09-17T14:03:42", "url": "https://files.pythonhosted.org/packages/19/f7/9f923dde26714be6bf164eb915f91cda2b670704d410e65ff910e5098ae7/pycorrector-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "bcdcda29d6eeb950eb3015b4c57f6aa7", "sha256": "ce7db4aa8e411cbf6144cfdeedb65272407b975bf2106bd492115766e231bdcd" }, "downloads": -1, "filename": "pycorrector-0.1.5.tar.gz", "has_sig": false, "md5_digest": "bcdcda29d6eeb950eb3015b4c57f6aa7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16310614, "upload_time": "2018-09-19T07:25:03", "url": "https://files.pythonhosted.org/packages/c4/f9/7163b8d3c2f1ecc476d736d154f6cf169ef176b782a7299266c87d455371/pycorrector-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "f588b3555479dd4d53c6b2505166def2", "sha256": "c747f63ebf2ce9303fadcbd4d54e8fd579a0ea45d631252448ad61b29a2164ff" }, "downloads": -1, "filename": "pycorrector-0.1.6.tar.gz", "has_sig": false, "md5_digest": "f588b3555479dd4d53c6b2505166def2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17856564, "upload_time": "2019-06-21T06:35:58", "url": "https://files.pythonhosted.org/packages/0e/cc/595c3a70191b8b6a15a263685f2ba946d3ffbeea1a31962139223c5178db/pycorrector-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "ec6f46fb168241bd461a278465f2d4f0", "sha256": "cf8576c14edb39f76e63ca9ca89884162013c75bae4651a6804d28d1e02da2ba" }, "downloads": -1, "filename": "pycorrector-0.1.7.tar.gz", "has_sig": false, "md5_digest": "ec6f46fb168241bd461a278465f2d4f0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17862074, "upload_time": "2019-06-27T03:28:20", "url": "https://files.pythonhosted.org/packages/3e/8f/7fdc7a932162a6dca38a0d426ce0a37c7ba06584d9f59040c99b2e84c7cc/pycorrector-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "50675ac572fd4e5ef61eb8e482b425c0", "sha256": "c970f5fc2b50587f605b536cb80d190b7864316149a728491678428cddeaa988" }, "downloads": -1, "filename": "pycorrector-0.1.8.tar.gz", "has_sig": false, "md5_digest": "50675ac572fd4e5ef61eb8e482b425c0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17862024, "upload_time": "2019-06-27T03:47:49", "url": "https://files.pythonhosted.org/packages/0a/34/6f21eb6afad56a311cc0773edb506b50a8b7d3352657e3427cf8d67f9308/pycorrector-0.1.8.tar.gz" } ], "0.1.9": [ { "comment_text": "", "digests": { "md5": "2cb95e1eee75da6f99a9ecdce34e8a86", "sha256": "809fa8796717a3b4d7151d1689efa7471e97cae9e925c83cdbbe9d051ab280b8" }, "downloads": -1, "filename": "pycorrector-0.1.9.tar.gz", "has_sig": false, "md5_digest": "2cb95e1eee75da6f99a9ecdce34e8a86", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17872393, "upload_time": "2019-08-03T13:58:30", "url": "https://files.pythonhosted.org/packages/59/68/80d628ffd92f9483380ca72ea206e867451f3db30fdeba6937ed7c580ca3/pycorrector-0.1.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2cb95e1eee75da6f99a9ecdce34e8a86", "sha256": "809fa8796717a3b4d7151d1689efa7471e97cae9e925c83cdbbe9d051ab280b8" }, "downloads": -1, "filename": "pycorrector-0.1.9.tar.gz", "has_sig": false, "md5_digest": "2cb95e1eee75da6f99a9ecdce34e8a86", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17872393, "upload_time": "2019-08-03T13:58:30", "url": "https://files.pythonhosted.org/packages/59/68/80d628ffd92f9483380ca72ea206e867451f3db30fdeba6937ed7c580ca3/pycorrector-0.1.9.tar.gz" } ] }