{ "info": { "author": "Katsuma Narisawa", "author_email": "katsuma.narisawa@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.7" ], "description": "# simstring\n[![PyPI - Status](https://img.shields.io/pypi/status/simstring-pure.svg)](https://pypi.org/project/simstring-pure/)\n[![PyPI version](https://badge.fury.io/py/simstring-pure.svg)](https://badge.fury.io/py/simstring-pure)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/simstring-pure.svg)](https://pypi.org/project/simstring-pure/0.0.1/)\n[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](LICENSE)\n[![CircleCI](https://circleci.com/gh/nullnull/simstring.svg?style=svg)](https://circleci.com/gh/nullnull/simstring)\n[![Maintainability](https://api.codeclimate.com/v1/badges/66eb2018262f03ece8a3/maintainability)](https://codeclimate.com/github/nullnull/simstring/maintainability)\n\n\nA Python implementation of the [SimString](http://www.chokkan.org/software/simstring/index.html.en), a simple and efficient algorithm for approximate string matching.\n\n## Features\nWith this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.\n\nThis library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.\n\nSimString has the following features:\n\n* Fast algorithm for approximate string retrieval.\n* 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.\n* Unicode support.\n* Extensibility. You can implement your own feature extractor easily.\n* Japanese support. [MeCab](http://taku910.github.io/mecab/)\u3092\u4f7f\u3063\u305f\u5f62\u614b\u7d20N\u30b0\u30e9\u30e0\u3092\u30b5\u30dd\u30fc\u30c8\u3057\u3066\u3044\u307e\u3059\u3002\n\n[Please see this paper for more details](http://www.aclweb.org/anthology/C10-1096).\n\n\n## Install\n```\npip install simstring-pure\n```\n\n## Usage\n```python\nfrom simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor\nfrom simstring.measure.cosine import CosineMeasure\nfrom simstring.database.dict import DictDatabase\nfrom simstring.searcher import Searcher\n\ndb = DictDatabase(CharacterNgramFeatureExtractor(2))\ndb.add('foo')\ndb.add('bar')\ndb.add('fooo')\n\nsearcher = Searcher(db, CosineMeasure())\nresults = searcher.search('foo', 0.8)\nprint(results)\n# => ['foo', 'fooo']\n```\n\nIf you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.\n\n```python\nfrom simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor\nfrom simstring.measure.jaccard import JaccardMeasure\nfrom simstring.database.mongo import MongoDatabase\nfrom simstring.searcher import Searcher\n\ndb = MongoDatabase(WordNgramFeatureExtractor(2))\ndb.add('You are so cool.')\n\nsearcher = Searcher(db, JaccardMeasure())\nresults = searcher.search('You are cool.', 0.8)\nprint(results)\n```\n\n## Supported String Similarity Measures\n- Cosine\n- Dice\n- Jaccard\n\n## Run Tests\n```\npython -m unittest discover tests\n```\n\n## Benchmark\n* About 1ms to search strings from 5797 strings(company names).\n* About 14ms to search strings from 235544 strings(unabridged dictionary).\n\n#### search from `dev/data/company_names.txt`\n```\n$ python dev/benchmark.py\nbenchmark for using dict as database\n## benchmarker: release 4.0.1 (for python)\n## python version: 3.7.0\n## python compiler: GCC 7.2.0\n## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable: /opt/conda/envs/simstring/bin/python\n## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz\n## parameters: loop=1, cycle=1, extra=0\n\n## real (total = user + sys)\ninitialize database(5797 lines) 0.1227 0.1200 0.1200 0.0000\nsearch text(5797 times) 6.9719 6.9400 6.8900 0.0500\n\n## Ranking real\ninitialize database(5797 lines) 0.1227 (100.0) ********************\nsearch text(5797 times) 6.9719 ( 1.8)\n\n## Matrix real [01] [02]\n[01] initialize database(5797 lines) 0.1227 100.0 5680.9\n[02] search text(5797 times) 6.9719 1.8 100.0\n\nbenchmark for using Mongo as database\n## benchmarker: release 4.0.1 (for python)\n## python version: 3.7.0\n## python compiler: GCC 7.2.0\n## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable: /opt/conda/envs/simstring/bin/python\n## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz\n## parameters: loop=1, cycle=1, extra=0\n\n## real (total = user + sys)\ninitialize database(5797 lines) 4.5762 2.4900 1.9200 0.5700\nsearch text(5797 times) 177.8401 60.9100 47.2500 13.6600\n\n## Ranking real\ninitialize database(5797 lines) 4.5762 (100.0) ********************\nsearch text(5797 times) 177.8401 ( 2.6) *\n\n## Matrix real [01] [02]\n[01] initialize database(5797 lines) 4.5762 100.0 3886.2\n[02] search text(5797 times) 177.8401 2.6 100.0\n```\n\n#### search from `dev/data/unabridged_dictionary.txt`\n```\n$ python dev/benchmark.py\nbenchmark for using dict as database\n## benchmarker: release 4.0.1 (for python)\n## python version: 3.7.0\n## python compiler: GCC 7.2.0\n## python platform: Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4\n## python executable: /opt/conda/envs/simstring/bin/python\n## cpu model: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz # 3300.000 MHz\n## parameters: loop=1, cycle=1, extra=0\n\n## real (total = user + sys)\ninitialize database(235544 lines) 2.2576 2.2300 2.1200 0.1100\nsearch text(10000 times) 141.0302 140.6400 139.9600 0.6800\n\n## Ranking real\ninitialize database(235544 lines) 2.2576 (100.0) ********************\nsearch text(10000 times) 141.0302 ( 1.6)\n\n## Matrix real [01] [02]\n[01] initialize database(235544 lines) 2.2576 100.0 6246.8\n[02] search text(10000 times) 141.0302 1.6 100.0\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/nullnull/simstring", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "simstring-pure", "package_url": "https://pypi.org/project/simstring-pure/", "platform": "", "project_url": "https://pypi.org/project/simstring-pure/", "project_urls": { "Homepage": "https://github.com/nullnull/simstring" }, "release_url": "https://pypi.org/project/simstring-pure/1.0.0/", "requires_dist": null, "requires_python": "", "summary": "A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.", "version": "1.0.0" }, "last_serial": 4099293, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "7ce680a0c70e226b557827ab395d3ee8", "sha256": "3ebe7ed7121b0fd52ce6490d88114fc41f5c03760358c5880fe50e03a8a7c6dc" }, "downloads": -1, "filename": "simstring_pure-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "7ce680a0c70e226b557827ab395d3ee8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7424, "upload_time": "2018-07-22T07:56:57", "url": "https://files.pythonhosted.org/packages/53/3c/ac84a35621e5dad06bd11e026ddc7d82268eca387bd7865c2708ecd11591/simstring_pure-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a0285e7a24bd166302f9178116434159", "sha256": "a9d9e36d1794a4dc3b9889d4bdbad838ea30979ecdd36bd8a11898fc178eb4b9" }, "downloads": -1, "filename": "simstring-pure-0.0.1.tar.gz", "has_sig": false, "md5_digest": "a0285e7a24bd166302f9178116434159", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3908, "upload_time": "2018-07-22T07:56:59", "url": "https://files.pythonhosted.org/packages/0e/97/1a5c492f036575c18be159ea3062ea53f054637170aa1c96ffccbb111e5f/simstring-pure-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "1664411e85bf08d7fbeb159f0e78e99f", "sha256": "75d182c80d6ca68bd7f82ca12155f0cbb8946c128db3d7264f5df594a6c5bcbd" }, "downloads": -1, "filename": "simstring_pure-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "1664411e85bf08d7fbeb159f0e78e99f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7749, "upload_time": "2018-07-22T08:23:54", "url": "https://files.pythonhosted.org/packages/21/cf/f3796ec1f69509f5fd19fd8ae582c234cd5437fb3b33a65fddc57a732368/simstring_pure-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4654cb239af9bd2cfabf117ed664d140", "sha256": "458547d8f7e04b8e75c530db24a351da91fc5dfaf9694599869745b61cac1bf0" }, "downloads": -1, "filename": "simstring-pure-0.0.2.tar.gz", "has_sig": false, "md5_digest": "4654cb239af9bd2cfabf117ed664d140", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4348, "upload_time": "2018-07-22T08:23:56", "url": "https://files.pythonhosted.org/packages/ec/2a/6acc9d6321324b9d08dab03eb9f7970f8a3b0716731ca89129c6818fec69/simstring-pure-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "430227201f0cf2b49e7dcc9a892dc586", "sha256": "4340482e02157cf0f435bc8c6df5d269d110ea7eaf2c54ddfae54916b09bc4e0" }, "downloads": -1, "filename": "simstring_pure-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "430227201f0cf2b49e7dcc9a892dc586", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9115, "upload_time": "2018-07-24T01:18:03", "url": "https://files.pythonhosted.org/packages/7b/d8/147ee167fb6dc6aed4e4c1428400ab070bb6d6b7afc9040b004bb0d83d98/simstring_pure-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a708a83042d2e75c5c3c6e3679c080d7", "sha256": "c2b857bae4ce08c9e61cf59346d33db30b2aaa9ffab627ed7e935a8577658b55" }, "downloads": -1, "filename": "simstring-pure-0.0.3.tar.gz", "has_sig": false, "md5_digest": "a708a83042d2e75c5c3c6e3679c080d7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6309, "upload_time": "2018-07-24T01:18:04", "url": "https://files.pythonhosted.org/packages/75/e7/c2dcdba7e3945ceb00e094c1258d70e5c82869757ad8c81d875b0b44fd4d/simstring-pure-0.0.3.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "be33e63ff2442f36142b3c88b4667967", "sha256": "ed3dd117159e2e64d2d01813cd8078ea73dff339764b32401bab2ae95d558039" }, "downloads": -1, "filename": "simstring_pure-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "be33e63ff2442f36142b3c88b4667967", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9853, "upload_time": "2018-07-25T03:16:55", "url": "https://files.pythonhosted.org/packages/dd/9d/0011c93c9c47452dc7c28f3c7e4cbd22dfa8a05ca3c70f75f020e1da5a67/simstring_pure-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fffef1f42db1e6d6eb0d8d86cd443849", "sha256": "0363f917cfcaa5950e16cb0042916ae3eaf398e530979b57f9031eaf93c32f26" }, "downloads": -1, "filename": "simstring-pure-1.0.0.tar.gz", "has_sig": false, "md5_digest": "fffef1f42db1e6d6eb0d8d86cd443849", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7894, "upload_time": "2018-07-25T03:16:57", "url": "https://files.pythonhosted.org/packages/d9/8b/07002566d530d62b1d61d8d027c51963e1ee4d62381705ddf3d522627957/simstring-pure-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "be33e63ff2442f36142b3c88b4667967", "sha256": "ed3dd117159e2e64d2d01813cd8078ea73dff339764b32401bab2ae95d558039" }, "downloads": -1, "filename": "simstring_pure-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "be33e63ff2442f36142b3c88b4667967", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9853, "upload_time": "2018-07-25T03:16:55", "url": "https://files.pythonhosted.org/packages/dd/9d/0011c93c9c47452dc7c28f3c7e4cbd22dfa8a05ca3c70f75f020e1da5a67/simstring_pure-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fffef1f42db1e6d6eb0d8d86cd443849", "sha256": "0363f917cfcaa5950e16cb0042916ae3eaf398e530979b57f9031eaf93c32f26" }, "downloads": -1, "filename": "simstring-pure-1.0.0.tar.gz", "has_sig": false, "md5_digest": "fffef1f42db1e6d6eb0d8d86cd443849", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7894, "upload_time": "2018-07-25T03:16:57", "url": "https://files.pythonhosted.org/packages/d9/8b/07002566d530d62b1d61d8d027c51963e1ee4d62381705ddf3d522627957/simstring-pure-1.0.0.tar.gz" } ] }