{ "info": { "author": "Ailln", "author_email": "kinggreenhall@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Simple Jieba\n\n\u2702\ufe0f \u7528 100 \u884c\u5b9e\u73b0\u7b80\u5355\u7248\u672c\u7684 [jieba](https://github.com/fxsjy/jieba) \u5206\u8bcd\u3002\n\n> \u771f\u5c31 100 \u884c\uff0c\u4e0d\u4fe1\u4f60\u53bb\u6570\u4e00\u4e0b [./simjb/model.py](./simjb/model.py)\u3002\n\n## \u6027\u80fd\u5bf9\u6bd4\n\n\u7531\u4e8e\u8be5\u7b80\u5355\u7248\u672c\u4ee3\u7801\u53ea\u5b9e\u73b0\u4e86 jieba \u5206\u8bcd\u7684\u6838\u5fc3\u529f\u80fd\uff0c\u53ef\u4ee5\u9884\u671f\u7684\u662f\uff1a**\u5206\u8bcd\u6b63\u786e\u7387\u4e0b\u964d\uff0c\u5206\u8bcd\u901f\u5ea6\u4e0a\u5347\u3002**\n\n\u6211\u4f7f\u7528\u4e86 [bakeoff2005](http://sighan.cs.uchicago.edu/bakeoff2005/) \u7684\u6570\u636e\u96c6\u4e2d\u7684 `Peking University` \u8bad\u7ec3\u96c6\u548c `Microsoft Research` \u8bad\u7ec3\u96c6\u8fdb\u884c\u6027\u80fd\u5bf9\u6bd4\uff0c\u5f97\u5230\u7684\u7ed3\u679c\u5982\u4e0b\uff1a\n\n| Peking University(pku) | \u6b63\u786e\u7387\uff08\u6b63\u786e\u8bcd\u6570/\u6240\u6709\u8bcd\u6570\uff09 | \u901f\u5ea6\uff08\u6240\u6709\u8bcd\u6570/\u82b1\u8d39\u65f6\u95f4\uff09 |\n| :---------------: | :-------------------------: | :-----------------------: |\n| jieba | **78.90%** (890761/1129003) | 119k (1129003/9.45s) |\n| simjb | 78.06% (881348/1129003) | **159k** (1129003/7.10s) |\n\n\n| Microsoft Research(msr) | \u6b63\u786e\u7387\uff08\u6b63\u786e\u8bcd\u6570/\u6240\u6709\u8bcd\u6570\uff09 | \u901f\u5ea6\uff08\u6240\u6709\u8bcd\u6570/\u82b1\u8d39\u65f6\u95f4\uff09 |\n| :----------------: | :--------------------------: | :-----------------------: |\n| jieba | 76.87% (1822646/2370974) | 133k (2370974/17.77s) |\n| simjb | **81.37%** (1929210/2370974) | **174k** (2370974/13.63s) |\n\nPeking University \u7684\u7ed3\u679c\u662f\u975e\u5e38\u7b26\u5408\u9884\u671f\u7684\uff0c\u6b63\u786e\u7387\u867d\u6709\u4e0b\u964d\uff0c\u4f46\u4e0d\u5230 1 \u4e2a\u767e\u5206\u70b9\u3002\u5728 Microsoft Research \u7684\u7ed3\u679c\u4e2d\uff0c\u6b63\u786e\u7387\u53cd\u800c\u6709\u4e9b\u8be1\u5f02\u7684\u5347\u9ad8\u4e86\u5c06\u8fd1 5 \u4e2a\u767e\u5206\u70b9\u3002\u5176\u6b21\uff0c\u5728\u5206\u8bcd\u901f\u5ea6\u4e0a\u4e24\u8005\u5747\u6709 **30%** \u5de6\u53f3\u7684\u63d0\u5347\uff01\n\n\u6211\u6700\u521d\u4ece jieba \u7684\u6e90\u7801\u4e2d\u6574\u7406\u51fa\u8fd9\u90e8\u5206\u7684\u6838\u5fc3\u4ee3\u7801\uff0c\u4ec5\u4ec5\u662f\u5e0c\u671b\u540e\u4eba\u60f3\u8981\u5b66\u4e60\u65f6\uff0c\u6709\u4e00\u4efd\u66f4\u7b80\u660e\u6613\u61c2\u7684\u5b66\u4e60\u8d44\u6599\u3002\u4ece\u4e0a\u6587\u7684\u7ed3\u679c\u6765\u770b\uff0c\u8fd9\u4e2a\u7b80\u5355\u7248\u672c\u4f3c\u4e4e\u662f\u53ef\u7528\u7684\u3002\uff08\u5927\u5bb6\u53ef\u4ee5\u505a\u66f4\u591a\u7684\u6d4b\u8bd5\u6765\u6253\u8138\uff0c\u54c8\u54c8\u54c8\uff09\n\n\u6d4b\u8bd5\u65b9\u6cd5\u89c1[\u8fd9\u91cc](./test/README.md)\u3002\n\n## \u6307\u5357\n\n![](./simjb/src/simple-jieba_flow.png)\n\n### 1 \u6839\u636e\u6807\u70b9\u5212\u5206\u533a\u5757\n\n```python\nimport re\n\nclass Tokenizer(object):\n def __init__(self):\n self.re_cn = re.compile(\"([\\u4E00-\\u9FD5a-zA-Z0-9+#&._%-]+)\", re.U)\n\n def cut(self, sentence):\n block_list = self.re_cn.split(sentence)\n cut_result_list = []\n for block in block_list:\n # \u8df3\u8fc7\u7a7a\u7684 block\n if not block:\n continue\n if self.re_cn.match(block):\n cut_result_list.extend(self.cut_util(block))\n else:\n cut_result_list.append(block)\n return cut_result_list\n```\n\n\u9996\u5148\u5c06\u8f93\u5165\u7684\u53e5\u5b50\u8fdb\u884c\u6b63\u5219\u5339\u914d\u5207\u5206\uff0c\u5b9e\u9645\u4e0a\u662f\u6807\u70b9\u7684\u524d\u540e\u5207\u5f00\u3002\n\n```\n\u5feb\u770b\uff0c\u662f\u6b66\u6c49\u5e02\u957f\u6c5f\u5927\u6865\uff01\n[\"\u5feb\u770b\", \"\uff0c\", \"\u662f\u6b66\u6c49\u5e02\u957f\u6c5f\u5927\u6865\", \"\uff01\"]\n```\n\n### 2 \u6839\u636e\u8bcd\u5178\u751f\u6210\u6709\u5411\u65e0\u73af\u56fe\n\n```python\ndef _get_freq_dict(self):\n stream = resource_stream(*self.dict_path)\n freq_dict = {}\n freq_total = 0\n for line in stream.readlines():\n word, freq = line.decode(\"utf-8\").split(\" \")[:2]\n freq = int(freq)\n freq_dict[word] = freq\n freq_total += freq\n for word_index in range(len(word)):\n word_frag = word[:word_index + 1]\n if word_frag not in freq_dict:\n freq_dict[word_frag] = 0\n return freq_dict, freq_total\n```\n\n\u9996\u5148\u6211\u4eec\u9700\u8981\u51c6\u5907\u4e00\u4e2a\u5e26\u6709\u8bcd\u9891\u7684\u8bcd\u5178\uff0c\u6bd4\u5982\uff1a\n\n```\nAT&T 3 nz\nB\u8d85 3 n\nc# 3 nz\nC# 3 nz\nc++ 3 nz\n...\n```\n\n\u63a5\u4e0b\u6765\u5c06\u5bf9\u8bcd\u5178\u8fdb\u884c\u9884\u5904\u7406\uff0c\u5f97\u5230\u4e00\u4e2a\u65b0\u7684 dict\u3002\n\n```\n{\n \"AT&T\": 3,\n \"A\": 0,\n \"AT\": 0,\n \"AT&\": 0,\n \"B\u8d85\": 3,\n \"B\": 0,\n ...\n}\n```\n\n\u4e0d\u77e5\u9053\u4f60\u6709\u6ca1\u6709\u6ce8\u610f\u5230\uff0c\u9664\u4e86\u76f4\u63a5\u6dfb\u52a0\u8bcd\u9891\u6784\u6210 dict\uff0c\u8fd9\u91cc\u8fd8\u5bf9\u6bcf\u4e00\u4e2a\u8bcd\u8fdb\u884c\u524d\u7f00\u5207\u5206\u3002\n\n\u8fd9\u4e9b\u524d\u7f00\u5bf9\u5e94\u7684\u8bcd\u9891\u89c4\u5219\u662f\uff1a\u5982\u679c\u5b83\u5728\u539f\u6765\u7684\u8bcd\u5178\u4e2d\uff0c\u90a3\u4e48\u5c31\u83b7\u53d6\u8fd9\u4e2a\u8bcd\u7684\u8bcd\u9891\uff1b\u5982\u679c\u4e0d\u5728\uff0c\u5c31\u7f6e\u4e3a 0\u3002\n\n\uff08TODO: \u524d\u7f00\u8bcd\u9891\u7684\u4f5c\u7528\uff09\n\n```python\ndef _get_dag(self, sentence):\n dag = {}\n sen_len = len(sentence)\n for i in range(sen_len):\n temp_list = []\n j = i\n frag = sentence[i]\n while j < sen_len and frag in self.freq_dict:\n if self.freq_dict[frag]:\n temp_list.append(j)\n j += 1\n frag = sentence[i:j + 1]\n if not temp_list:\n temp_list.append(i)\n dag[i] = temp_list\n return dag\n```\n\n\u4ece\u5934\u904d\u5386\u6240\u6709\u7684\u957f\u5ea6\u7684\u8bcd\uff0c\u5982\u679c\u5b83\u5728\u8bcd\u9891\u5b57\u5178\u4e2d\uff0c\u5c31\u628a\u7684\u9700\u8981\u8bb0\u5f55\u4e0b\u6765\uff0c\u6784\u6210\u6709\u5411\u65e0\u73af\u56fe\uff08DAG\uff09\u3002\n\n```\n# \u5feb\u770b\n{0: [0], 1: [1]}\n# \u662f\u6b66\u6c49\u5e02\u957f\u6c5f\u5927\u6865\n{0: [0], 1: [1, 2, 3], 2: [2], 3: [3, 4], 4: [4, 5, 7], 5: [5], 6: [6, 7], 7: [7]}\n```\n\n### 3 \u4f7f\u7528\u52a8\u6001\u89c4\u5212\u6c42\u89e3\u6700\u5927\u9891\u7387\u8def\u5f84\n\n```python\ndef _calc_dag_with_dp(self, sentence):\n dag = self._get_dag(sentence)\n sen_len = len(sentence)\n route = {sen_len: (0, 0)}\n # \u53d6 log \u9632\u6b62\u6570\u503c\u4e0b\u6ea2\n log_total = log(self.freq_total)\n for sen_index in reversed(range(sen_len)):\n freq_list = []\n for word_index in dag[sen_index]:\n word_freq = self.freq_dict.get(sentence[sen_index:word_index + 1])\n # \u89e3\u51b3 log(0) \u65e0\u5b9a\u4e49\u95ee\u9898, \u5219\u53d6 log(1)=0\n freq_index = (log(word_freq or 1) - log_total + route[word_index + 1][0], word_index)\n freq_list.append(freq_index)\n route[sen_index] = max(freq_list)\n return route\n```\n\n\u4f7f\u7528\u52a8\u6001\u89c4\u5212\u53cd\u5411\u9012\u63a8\u51fa\u6700\u4f18\u8def\u5f84\u3002\uff08TODO: \u4f18\u5316\u89e3\u91ca\uff09\n\n### 4. \u5408\u5e76\u6240\u6709\u533a\u5757\u5207\u5206\u7ed3\u679c\n\n```python\ndef cut_util(self, sentence):\n word_index = 0\n word_buf = \"\"\n result = []\n route = self._calc_dag_with_dp(sentence)\n while word_index < len(sentence):\n word_index_end = route[word_index][1] + 1\n word = sentence[word_index:word_index_end]\n # \u5339\u914d\u51fa\u82f1\u6587\n if self.re_eng.match(word) and len(word) == 1:\n word_buf += word\n word_index = word_index_end\n else:\n if word_buf:\n result.append(word_buf)\n word_buf = \"\"\n else:\n result.append(word)\n word_index = word_index_end\n # \u7eaf\u82f1\u6587\n if word_buf:\n result.append(word_buf)\n return result\n```\n\n\u5bf9\u4e8e\u4e00\u8fde\u4e32\u7684\u82f1\u6587\u5b57\u6bcd\u4f1a\u4f5c\u4e3a\u4e00\u4e2a\u6574\u7684\u5355\u8bcd\u5904\u7406\uff0c\u4e0d\u4f1a\u88ab\u5207\u5206\u5f00\u3002\n\n\u6700\u540e\u628a\u6240\u6709\u7ed3\u679c\u6c47\u603b\uff0c\u5206\u8bcd\u5b8c\u6210\uff01\n\n## \u8bb8\u53ef\n\n[![](https://award.dovolopor.com?lt=License&rt=MIT&rbc=green)](./LICENSE)\n\n## \u53c2\u8003\n\n- [\u4e2d\u6587\u5206\u8bcd\u76f8\u5173\u8d44\u6599](https://github.com/HaveTwoBrush/nlp-roadmap#1-%E5%88%86%E8%AF%8D-word-segmentation)\n- [\u5982\u4f55\u4ece\u6a21\u677f\u521b\u5efa\u4ed3\u5e93\uff1f](https://help.github.com/cn/articles/creating-a-repository-from-a-template)\n- [\u5982\u4f55\u53d1\u5e03\u81ea\u5df1\u7684\u5305\u5230 pypi \uff1f](https://www.v2ai.cn/python/2018/07/30/PY-1.html)", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/HaveTwoBrush/simple-jieba", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "simjb", "package_url": "https://pypi.org/project/simjb/", "platform": "", "project_url": "https://pypi.org/project/simjb/", "project_urls": { "Homepage": "https://github.com/HaveTwoBrush/simple-jieba" }, "release_url": "https://pypi.org/project/simjb/0.1.0/", "requires_dist": null, "requires_python": "", "summary": "A simple version of jieba.", "version": "0.1.0" }, "last_serial": 5980091, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "3ab7414c3f7afde3b3c09dd1a45d82f5", "sha256": "975b8d28e6580bac90f74abc32cc3ba85b401b48d2495d6497ec9a98dfd97ddb" }, "downloads": -1, "filename": "simjb-0.1.0.tar.gz", "has_sig": false, "md5_digest": "3ab7414c3f7afde3b3c09dd1a45d82f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7325, "upload_time": "2019-10-15T21:57:16", "url": "https://files.pythonhosted.org/packages/94/0f/a4b0ed3db98ae73aebfcaa0abd792280955d2e85fa014718a488e4316972/simjb-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3ab7414c3f7afde3b3c09dd1a45d82f5", "sha256": "975b8d28e6580bac90f74abc32cc3ba85b401b48d2495d6497ec9a98dfd97ddb" }, "downloads": -1, "filename": "simjb-0.1.0.tar.gz", "has_sig": false, "md5_digest": "3ab7414c3f7afde3b3c09dd1a45d82f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7325, "upload_time": "2019-10-15T21:57:16", "url": "https://files.pythonhosted.org/packages/94/0f/a4b0ed3db98ae73aebfcaa0abd792280955d2e85fa014718a488e4316972/simjb-0.1.0.tar.gz" } ] }