{ "info": { "author": "Mithril | eromoe", "author_email": "eromoe@users.noreply.github.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "# SimCRF\n\nThis project is aim to provide a super easy way to train crf model and extract entities from text.\n\n[\u4e2d\u6587\u6587\u6863](https://github.com/eromoe/SimCRF/blob/master/README.CN.md)\n\n## Installation\n\n pip install simcrf\n\n## Training Data Format\n\ncrf usually use iob tagging (https://en.wikipedia.org/wiki/Inside_Outside_Beginning)\n\ninput data can be:\n\n1. word, pos_tag, iob_tag\n2. word, iob_tag\n\niob_tag:\n\n- I: inside of entity\n- O: outside of entity\n- B: beginning of entity\n\nExample:\n\n \u6253\u5370\u673a n O\n \u91c7\u8d2d v O\n \u54c1\u76ee n O\n \u91c7\u8d2d v O\n \u5355\u4f4d n O\n \u66f2\u5468\u53bf nr B\n \u804c\u4e1a n I\n \u6280\u672f n I\n \u6559\u80b2 vn I\n \u4e2d\u5fc3 n I\n \u884c\u653f\u533a\u57df n O\n \u66f2\u5468\u53bf nr O\n \u516c\u544a n O\n \u65f6\u95f4 n O\n\n \u6280\u672f n I\n \u6559\u80b2 vn I\n \u4e2d\u5fc3 n I\n \u91c7\u8d2d v O\n \u5355\u4f4d\u5730\u5740 n O\n \u66f2\u5468\u53bf nr B\n \u804c\u4e1a n I\n \u6280\u672f n I\n \u6559\u80b2 vn I\n \u4e2d\u5fc3 n I\n \u91c7\u8d2d v O\n \u5355\u4f4d n O\n \u8054\u7cfb\u65b9\u5f0f l O\n 18932708288 m O\n\n \u4e2d\u5fc3 n I\n \u91c7\u8d2d v O\n \u4eba n O\n \u5730\u5740 n O\n \uff1a x O\n \u66f2\u5468\u53bf nr B\n \u804c\u4e1a n I\n \u6280\u672f n I\n \u6559\u80b2 vn I\n \u4e2d\u5fc3 n I\n \u91c7\u8d2d v O\n \u4eba n O\n \u8054\u7cfb\u65b9\u5f0f l O\n \uff1a x O\n\n\n## Usage\n\n#### Train model:\n\n from simcrf import SimCRF\n\n ner = SimCRF()\n\n # note: also support only tokens\n X_train = [\n [\n ('\u6253\u5370\u673a', 'n'), ('\u91c7\u8d2d', 'v'), ('\u54c1\u76ee', 'n'), ('\u91c7\u8d2d', 'v'), ('\u5355\u4f4d', 'n'), ('\u66f2\u5468\u53bf', 'nr'), ('\u804c\u4e1a', 'n'), ('\u6280\u672f', 'n'), ('\u6559\u80b2', 'vn'), ('\u4e2d\u5fc3', 'n'), ('\u884c\u653f\u533a\u57df', 'n'), ('\u66f2\u5468\u53bf', 'nr'), ('\u516c\u544a', 'n'), ('\u65f6\u95f4', 'n')\n ],\n [\n ('\u6253\u5370\u673a', 'n'), ('\u91c7\u8d2d', 'v'), ('\u54c1\u76ee', 'n'), ('\u91c7\u8d2d', 'v'), ('\u5355\u4f4d', 'n'), ('\u66f2\u5468\u53bf', 'nr'), ('\u804c\u4e1a', 'n'), ('\u6280\u672f', 'n'), ('\u6559\u80b2', 'vn'), ('\u4e2d\u5fc3', 'n'), ('\u884c\u653f\u533a\u57df', 'n'), ('\u66f2\u5468\u53bf', 'nr'), ('\u516c\u544a', 'n'), ('\u65f6\u95f4', 'n')\n ]\n ]\n\n y_train = [\n ['O','O','O','O','O','B','I','I','I','I','O','O','O','O'],\n ['O','O','O','O','O','B','I','I','I','I','O','O','O','O']\n ]\n\n X_train = ner.transform(X_train)\n ner.fit(X_train, y_train)\n\n#### Save model\n\n ner.save('~/crf_test.pkl')\n\n#### Load model\n\n ner = SimCRF.load('~/crf_test.pkl')\n\n#### Extract entities\n\nTo support different tokenizer, you need tokenize your text first and feed to crf model.\n\n import jieba.posseg as pseg\n ner = SimCRF.load('xxxx.pkl')\n\n text = ''' \u3000\u54c8\u5c14\u6ee8\u5de5\u4e1a\u5927\u5b66\u62db\u6807\u4e0e\u91c7\u8d2d\u7ba1\u7406\u4e2d\u5fc3\u53d7\u603b\u52a1\u5904\u7684\u59d4\u6258\uff0c\u5c31\u54c8\u5c14\u6ee8\u5de5\u4e1a\u5927\u5b66\u90e8\u5206\u4f4f\u5b85\u5c0f\u533a\u4f9b\u70ed\u5165\u7f51\u9879\u76ee\uff08\u9879\u76ee\u7f16\u53f7\uff1aGC2017DX035\uff09\u7ec4\u7ec7\u91c7\u8d2d\uff0c\u8bc4\u6807\u5de5\u4f5c\u5df2\u7ecf\u7ed3\u675f\uff0c\u4e2d\u6807\u7ed3\u679c\u5982\u4e0b\uff1a\n\n \u4e00\u3001\u9879\u76ee\u4fe1\u606f\n\n \u9879\u76ee\u7f16\u53f7\uff1aGC2017DX035\n\n \u9879\u76ee\u540d\u79f0\uff1a\u54c8\u5c14\u6ee8\u5de5\u4e1a\u5927\u5b66\u90e8\u5206\u4f4f\u5b85\u5c0f\u533a\u4f9b\u70ed\u5165\u7f51\n\n \u9879\u76ee\u8054\u7cfb\u4eba\uff1a\u674e\u5360\u594e \u738b \u5409\n\n \u8054\u7cfb\u65b9\u5f0f\uff1a\u7535\u8bdd\uff1a 0451-86417953 13936645563\n\n\n\n \u4e8c\u3001\u91c7\u8d2d\u5355\u4f4d\u4fe1\u606f\n\n \u91c7\u8d2d\u5355\u4f4d\u540d\u79f0\uff1a\u603b\u52a1\u5904\n\n \u91c7\u8d2d\u5355\u4f4d\u5730\u5740\uff1a\u54c8\u5c14\u6ee8\u5e02\u5357\u5c97\u533a\u897f\u5927\u76f4\u885792\u53f7\n\n \u91c7\u8d2d\u5355\u4f4d\u8054\u7cfb\u65b9\u5f0f\uff1a\u5b54\u7e41\u6b66 0451-86417975\n\n\n\n \u4e09\u3001\u9879\u76ee\u7528\u9014\u3001\u7b80\u8981\u6280\u672f\u8981\u6c42\u53ca\u5408\u540c\u5c65\u884c\u65e5\u671f\uff1a\n\n \u89c1\u7ed3\u679c\u516c\u793a\n\n\n\n \u56db\u3001\u91c7\u8d2d\u4ee3\u7406\u673a\u6784\u4fe1\u606f\n\n \u91c7\u8d2d\u4ee3\u7406\u673a\u6784\u5168\u79f0\uff1a\u54c8\u5c14\u6ee8\u5de5\u4e1a\u5927\u5b66\u62db\u6807\u4e0e\u91c7\u8d2d\u7ba1\u7406\u4e2d\u5fc3\n\n \u91c7\u8d2d\u4ee3\u7406\u673a\u6784\u5730\u5740\uff1a\u54c8\u5c14\u6ee8\u5e02\u5357\u5c97\u533a\u897f\u5927\u76f4\u885792\u53f7\u54c8\u5c14\u6ee8\u5de5\u4e1a\u5927\u5b66\u884c\u653f\u529e\u516c\u697c203\u623f\u95f4\n\n \u91c7\u8d2d\u4ee3\u7406\u673a\u6784\u8054\u7cfb\u65b9\u5f0f\uff1a\u674e\u5360\u594e \u738b \u5409 \u7535\u8bdd\uff1a 0451-86417953 13936645563\n '''\n\n sent = [tuple(pair) for pair in pseg.cut(text)]\n ret = ner.extract_taggedtokens(sent)\n\n print(ret)\n\n#### Custom crfsuite model\n\nSimCrf aim to provide a simple and easy way to train and extract entities.\nIt take off the feature trasfroming and trainning apart from you.So to customize crf model, you need train a sklearn-crfsuite model. You would change trainning parameter and generation of features yourself, and pass the model to SimCRF:\n\n from simcrf import SimCRF\n import sklearn_crfsuite\n\n crf_model = sklearn_crfsuite.CRF(\n algorithm='lbfgs',\n c1=0.1,\n c2=0.1,\n max_iterations=100,\n all_possible_transitions=True\n )\n crf_model.fit(X_train, y_train)\n\n ner = SimCRF(crf_model)\n\n ret = ner.extract(sent)\n\nsklearn-crfsuite docs: https://sklearn-crfsuite.readthedocs.io/\n\ncrfsuite docs: http://www.chokkan.org/software/crfsuite/manual.html\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/eromoe/SimCRF", "keywords": "c,r,f, ,c,r,f,s,u,i,t,e", "license": "", "maintainer": "", "maintainer_email": "", "name": "simcrf", "package_url": "https://pypi.org/project/simcrf/", "platform": "", "project_url": "https://pypi.org/project/simcrf/", "project_urls": { "Homepage": "https://github.com/eromoe/SimCRF" }, "release_url": "https://pypi.org/project/simcrf/0.1.8/", "requires_dist": [ "sklearn-crfsuite", "numpy", "jieba", "six", "pytest" ], "requires_python": "", "summary": "simple and quick crf wrapper for crfsuite", "version": "0.1.8" }, "last_serial": 3875060, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "d0f73bc05a7f01e741b6d46fa009007e", "sha256": "c01c756be93ef50472b386f48d1ee703399d04a76c5dafd7e352b95d42fd36fd" }, "downloads": -1, "filename": "simcrf-0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d0f73bc05a7f01e741b6d46fa009007e", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10286, "upload_time": "2017-07-27T08:21:12", "url": "https://files.pythonhosted.org/packages/a9/52/ff2e211303153b81272dec5ec8075371cefe8e1b99e7e5d1ca97790cef00/simcrf-0.1-py2.py3-none-any.whl" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "0b1cee745efc3c3daaa5bcb05f69b67d", "sha256": "26c050ff2b6ceba4e92a76c91f8d5a0cc843c78f149bab17696fcce1e964c3fe" }, "downloads": -1, "filename": "simcrf-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "0b1cee745efc3c3daaa5bcb05f69b67d", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10664, "upload_time": "2017-07-31T01:08:19", "url": "https://files.pythonhosted.org/packages/79/20/eeb47e494e235c2fe28f5f7079b773419887d06236ab11cddab869cfdc45/simcrf-0.1.1-py2.py3-none-any.whl" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "de655328a273306adaee8c38377d3e24", "sha256": "743deb280a7aa1cc3a3e47d42781da6091139f5b9f9a1814aff0a57e05861981" }, "downloads": -1, "filename": "simcrf-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "de655328a273306adaee8c38377d3e24", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 9771, "upload_time": "2017-08-29T08:28:09", "url": "https://files.pythonhosted.org/packages/66/e1/81bf03f364aa5eb724dd05efd1307b526f75451992c2d7e7615e1972c70f/simcrf-0.1.2-py2.py3-none-any.whl" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "89429a12c0a5bd5a2148be1cdf626d1b", "sha256": "e14999b75761bffe638523d74f772367dc6dedd7c37220680221155b09c28026" }, "downloads": -1, "filename": "simcrf-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "89429a12c0a5bd5a2148be1cdf626d1b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6424, "upload_time": "2018-04-19T05:58:15", "url": "https://files.pythonhosted.org/packages/65/cf/be613ddfc1ceb9f577e72da514790cc948a9a7189f186715a44a1c1f506c/simcrf-0.1.4-py2.py3-none-any.whl" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "37b1f1336c40e2412992e214e894aaf4", "sha256": "f1807bce5dfbc45c45393dbb7595c55138d576d82d708eb0ac17140e6ff76de5" }, "downloads": -1, "filename": "simcrf-0.1.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "37b1f1336c40e2412992e214e894aaf4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6533, "upload_time": "2018-04-20T03:19:04", "url": "https://files.pythonhosted.org/packages/6f/02/437127321c54646a4d59009de830870751fef47026a3f14dcbbe12d73ca3/simcrf-0.1.5-py2.py3-none-any.whl" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "e3e98ca3ffcdc2d660cb896b74ae94af", "sha256": "e4219650d0c5df9353018a5f61412862d1387265a36227927ef4862bf1aae908" }, "downloads": -1, "filename": "simcrf-0.1.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "e3e98ca3ffcdc2d660cb896b74ae94af", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6544, "upload_time": "2018-04-24T01:07:39", "url": "https://files.pythonhosted.org/packages/40/8f/20e26a9b5313e79e22b7e19ce7f73a459085f13015e33c18187720ca0e98/simcrf-0.1.6-py2.py3-none-any.whl" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "e94b00a1069ece1c9eeebb9c4190614f", "sha256": "4bc4b8edcca167a6fda43f42c4317c849410bd4be5491677866354232337c979" }, "downloads": -1, "filename": "simcrf-0.1.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "e94b00a1069ece1c9eeebb9c4190614f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6678, "upload_time": "2018-05-14T00:40:21", "url": "https://files.pythonhosted.org/packages/16/58/c2c462fe74ad4f587ea217c36b3950e8c2ce3384405d3bdfdf741b753a1f/simcrf-0.1.7-py2.py3-none-any.whl" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "d23b9bfe08a6811d51707fb288ad69dc", "sha256": "89514ac4464b6b9316e0f9f12937c4b223be45ba177b02ff46dcffd9a162cab3" }, "downloads": -1, "filename": "simcrf-0.1.8-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d23b9bfe08a6811d51707fb288ad69dc", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6695, "upload_time": "2018-05-18T07:52:12", "url": "https://files.pythonhosted.org/packages/79/93/9a58f33a8d7450281123126ac7c7d7640229752d1ff21e60a6ca6f169617/simcrf-0.1.8-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d23b9bfe08a6811d51707fb288ad69dc", "sha256": "89514ac4464b6b9316e0f9f12937c4b223be45ba177b02ff46dcffd9a162cab3" }, "downloads": -1, "filename": "simcrf-0.1.8-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d23b9bfe08a6811d51707fb288ad69dc", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6695, "upload_time": "2018-05-18T07:52:12", "url": "https://files.pythonhosted.org/packages/79/93/9a58f33a8d7450281123126ac7c7d7640229752d1ff21e60a6ca6f169617/simcrf-0.1.8-py2.py3-none-any.whl" } ] }