{ "info": { "author": "Chuancong Gao", "author_email": "chuancong@gmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3" ], "description": "[![PyPI version](https://img.shields.io/pypi/v/TopSim.svg)](https://pypi.python.org/pypi/TopSim/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/TopSim.svg)](https://pypi.python.org/pypi/TopSim/)\n[![PyPI license](https://img.shields.io/pypi/l/TopSim.svg)](https://pypi.python.org/pypi/TopSim/)\n\nSearch the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are supported right now, including [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index) and [Tversky](https://en.wikipedia.org/wiki/Tversky_index).\n\n# Installation\n\nThis package is available on PyPI. Just use `pip3 install -U TopSim` to install it.\n\n# CLI Usage\n\nYou can simply use the algorithm on terminal.\n\n```\nUsage:\n topsim-cli [options] []\n\n\nOptions:\n -I Case-sensitive matching.\n -k Maximum number of search results. [default: 1]\n --tie Include all the results with the same similarity of the \"k\"-th result. May return more than \"k\" results.\n\n -s, --search Search the query within each line rather than against the whole line, by preferring partial matching of the line.\n Tversky similarity is used instead of Jaccard similarity.\n -e Parameter for \"tversky\" similarity. [default: 0.001]\n\n --mapping= Map each string to a set of either \"gram\"s or \"word\"s. [default: gram]\n --numgrams= Number of characters for each gram when mapping by \"gram\". [default: 2]\n\n --quiet Do not print additional information to standard error.\n```\n\n* The query is matched against each line of the input file (or standard input).\n\n- Each line and its similarity are separated by tab character `\\t`.\n\n# API Usage\n\nAlternatively, you can use the algorithm via API.\n\n``` python\nfrom topsim import TopSim\n\nts = TopSim([\n \"python2\",\n \"python2.7\",\n \"python3\",\n \"python3.6\",\n])\n\nprint(ts.search(\"python\", k=3)) # Return each similarity and the respective line numbers.\n```\n\n* Please check `topsim.py` for more optional parameters, like similarity function, etc.\n\n# Examples\n\n* Search the most similar line.\n\n`ls /usr/bin | topsim-cli \"top\"`\n\n```\ntop\t1.0\n```\n\n* Search the three most similar lines.\n\n`ls /usr/bin | topsim-cli \"top\" -k 3`\n\n```\ntop\t1.0\ntops\t0.5\niotop\t0.4286\n```\n\n* Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.\n\n`ls /usr/bin | topsim-cli \"git\" -k 5`\n\n```\ngit\t1.0\nwait\t0.2857\ngit-shell\t0.2727\npluginkit\t0.2727\nkinit\t0.25\n```\n\n* Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.\n\n`ls /usr/bin | topsim-cli \"git\" -k 5 -s`\n\n```\ngit\t1.0\ngit-shell\t0.7489\npluginkit\t0.7489\ngit-cvsserver\t0.7481\ngit-upload-pack\t0.7478\n```\n\n- For `n`-gram mapping, higher number of `n` for can result in better accuracy but fewer matches.\n\n`ls /usr/bin | topsim-cli \"git\" -k 5 -s --numgrams=3`\n\n```\ngit\t1.0\ngit-shell\t0.5993\ngit-cvsserver\t0.5988\ngit-upload-pack\t0.5986\ngit-receive-pack\t0.5984\n```\n\n- Full support of Chinese/Japanese/Korean.\n\n`cat test`\n\n``` text\n\u5730\u4e09\u9c9c\n\u7ea2\u70e7\u8089\n\u70e4\u5168\u725b\n\u6728\u987b\u8089\n\u571f\u8c46\u7096\u725b\u8089\n```\n\n`cat test | topsim-cli \"\u725b\u8089\" -k 3 -s`\n\n``` text\n\u571f\u8c46\u7096\u725b\u8089\t0.666\n\u7ea2\u70e7\u8089\t0.3332\n\u6728\u987b\u8089\t0.3332\n```\n\n# Tip\nI strongly encourage using PyPy instead of CPython to run the script for best performance.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/chuanconggao/TopSim/tarball/0.1.5", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chuanconggao/TopSim", "keywords": "similarity-search,string-search", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "TopSim", "package_url": "https://pypi.org/project/TopSim/", "platform": "", "project_url": "https://pypi.org/project/TopSim/", "project_urls": { "Download": "https://github.com/chuanconggao/TopSim/tarball/0.1.5", "Homepage": "https://github.com/chuanconggao/TopSim" }, "release_url": "https://pypi.org/project/TopSim/0.1.5/", "requires_dist": null, "requires_python": ">= 3", "summary": "Search the most similar strings against the query in Python 3.", "version": "0.1.5" }, "last_serial": 3837612, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "bf9c2dc7907167a8ad63ed5f743357b3", "sha256": "848a4909658b806ad014e9519ecd96c6fa73ca8612cdd77502fe06d7ef6b461a" }, "downloads": -1, "filename": "TopSim-0.1.tar.gz", "has_sig": false, "md5_digest": "bf9c2dc7907167a8ad63ed5f743357b3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5128, "upload_time": "2018-04-11T02:40:43", "url": "https://files.pythonhosted.org/packages/c5/05/6c26e4d80bf169b8a4411b3b8d66b99316617fcc8b8ce33957d727aeac02/TopSim-0.1.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "586b9d5f8927b5ac7c72f8803ddfa694", "sha256": "fba5ae1aaaa312d22c3713fc6cbd338247286831d09bf1f863ab76532d45a1ac" }, "downloads": -1, "filename": "TopSim-0.1.3.tar.gz", "has_sig": false, "md5_digest": "586b9d5f8927b5ac7c72f8803ddfa694", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5589, "upload_time": "2018-04-30T19:50:41", "url": "https://files.pythonhosted.org/packages/1b/b3/101383233ee7d4bc7aa154fdfd30c1e672d91d8c58d61a1a5ccb17e0b311/TopSim-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "1776680b5784e03eec1fb4d129768db5", "sha256": "fc8ec330a1d533d20c0122c30e8b7935a97da7aa4626052e830f91b74b18f1c1" }, "downloads": -1, "filename": "TopSim-0.1.4.tar.gz", "has_sig": false, "md5_digest": "1776680b5784e03eec1fb4d129768db5", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5755, "upload_time": "2018-05-01T10:01:57", "url": "https://files.pythonhosted.org/packages/73/99/9e03b13b81dfb0b10c7f2c1541aa03ea7e9ec875ec795ce60b7d4c671cab/TopSim-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "5cf765a6cd6faa5c12a8191ff4ff160b", "sha256": "954160f58cb05d0d45b47b67862ebecbf6ab8d5dfa00987f37da7eba08fceadc" }, "downloads": -1, "filename": "TopSim-0.1.5.tar.gz", "has_sig": false, "md5_digest": "5cf765a6cd6faa5c12a8191ff4ff160b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5659, "upload_time": "2018-05-05T21:54:35", "url": "https://files.pythonhosted.org/packages/70/cf/2e6e2391dc17ab39f9335e83b6ff83ce9831e70f4d99fd6e8b55f54a254b/TopSim-0.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5cf765a6cd6faa5c12a8191ff4ff160b", "sha256": "954160f58cb05d0d45b47b67862ebecbf6ab8d5dfa00987f37da7eba08fceadc" }, "downloads": -1, "filename": "TopSim-0.1.5.tar.gz", "has_sig": false, "md5_digest": "5cf765a6cd6faa5c12a8191ff4ff160b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5659, "upload_time": "2018-05-05T21:54:35", "url": "https://files.pythonhosted.org/packages/70/cf/2e6e2391dc17ab39f9335e83b6ff83ce9831e70f4d99fd6e8b55f54a254b/TopSim-0.1.5.tar.gz" } ] }