{ "info": { "author": "Eric Zhu", "author_email": "ekzhu@cs.toronto.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# Set Similarity Search\n\n[![Build Status](https://travis-ci.org/ekzhu/SetSimilaritySearch.svg?branch=master)](https://travis-ci.org/ekzhu/SetSimilaritySearch)\n\nEfficient set similarity search algorithms in Python. \nFor even better performance see the\n[Go Implementation](https://github.com/ekzhu/go-set-similarity-search).\n\n## What is set similarity search?\n\nLet's say we have a database of users and the books they have read.\nAssume that we want to recommend \"friends\" for each user,\nand the \"friends\" must have read very similar set of books\nas the user have. We can model this as a set similarity search problem,\nby representing each user's books as a set:\n\n```\nAlice: {\"Anna Karenina\", \"War and Peace\", \"The Chameleon\", ...}\nBob: {\"Lolita\", \"The Metamorphosis\", \"The Judgement\", ...}\nJoey: {\"Anna Karenina\", \"The Chameleon\" ...}\n```\n\nA popular way to measure the similarity between two sets is \n[Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index), which\ngives a fractional score between 0 and 1.0. \n\nThere are two versions of set similarity search problem, \nboth can be defined given a collection of sets, a \nsimilarity function and a threshold:\n\n1. *All-Pairs:* find all pairs of sets that have \nsimilarities greater than (or equal to) the threshold;\n2. *Query:* given a query set, from the collection of sets, find all that\nhave similarities greater than (or equal to) the threshold with respect to\nthe query set.\n\nBoth versions of the problem can be very computationally expensive \nas the collection can be large and the set sizes can be large. \nThe simple brute-force algorithm is O(n^2) for (1) and O(n) for (2).\n\nThis package includes a Python implementation of the \"All-Pair-Binary\" \nalgorithm in\n[Scaling Up All Pairs Similarity Search](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/32781.pdf)\npaper, with additional position filter optimization. \nThis algorithm still has the same worst-case complexity as the brute-force \nalgorithm, however, by taking advantage of skewness in empirical\ndistributions of set sizes and frequencies, it often runs much faster\n(even better than [MinHash LSH](https://ekzhu.github.io/datasketch/lsh.html)).\n\n## Benchmarks\n\nRun *All-Pairs* on 3.5 GHz Intel Core i7, using similarity function `jaccard` \nand similarity threshold 0.5. \nThe running time of [`datasketch.MinHashLSH`](https://ekzhu.github.io/datasketch/lsh.html) is also shown below for \ncomparison (`num_perm=32`).\n\n| Dataset | Input Sets | Avg. Size | `SetSimilaritySearch` Runtime | `datasketch` Runtime | `datasketch` Accuracy |\n|---------|--------------|--------------|---------|------|--|\n| [Pokec social network (relationships)](https://snap.stanford.edu/data/soc-Pokec.html): from-nodes are set IDs; to-nodes are elements | 1432693 | 27.31 | 10m49s | 11m4s | Precision: 0.73; Recall: 0.67 |\n| [LiveJournal](https://snap.stanford.edu/data/soc-LiveJournal1.html): from-nodes are set IDs; to-nodes are elements | 4308452 | 16.01 | 28m51s | 31m58s | Precision: 0.79; Recall: 0.74|\n\nAlthough `datasketch.MinHashLSH` is an approximate algorithm, and I am using `num_perm=32` which is quite low, it is still \na bit slower than the exact algorithm `SetSimilaritySearch`. \nThe time for\ncreating `datasketch.MinHash` is also included in the end-to-end time, while\nin practice this time can be saved through pre-computation. However, for \n*ad hoc* computation of *All-Pairs*, `SetSimilaritySearch` is still\nthe better choice, especially when sets are small and fit in memory.\n\nRun *Query* on 3.5 GHz Intel Core i7, using similarity function `jaccard` \nand similarity threshold 0.5. \nThe query sets are sampled from the dataset itself.\nThe running time of [`datasketch.MinHashLSH`](https://ekzhu.github.io/datasketch/lsh.html) is also shown below for \ncomparison (`num_perm=32`).\n\n| Dataset | Indexed Sets | Query Sets | Avg. Size | `SetSimilaritySearch` Indexing & Querying Time | `datasketch` Indexing & Querying Time | `datasketch` Accuracy |\n|--|--|--|--|--|--|--|\n| [Pokec social network (relationships)](https://snap.stanford.edu/data/soc-Pokec.html): from-nodes are set IDs; to-nodes are elements | 1432693 | 10k | 27.31 | Indexing: 1m7s; Querying (90pct): 2.3ms | Indexing: 9m23s; Querying (90pct): 0.72ms | Precision: 0.90; Recall: 0.88 |\n| [LiveJournal](https://snap.stanford.edu/data/soc-LiveJournal1.html): from-nodes are set IDs; to-nodes are elements | 4308452 | 10k | 16.01 | Indexing: 2m32s; Querying (90pct): 1.6ms | Indexing: 30m58s; Querying (90pct): 2.1ms | Precision: 0.85; Recall: 0.78|\n\nThe indexing time for `datasketch.MinHashLSH`, including the time for \ncreating `datasketch.MinHash`, is much worse than `SetSimilaritySearch` --\nnearly 10x and 15x. Therefore `SetSimilaritySearch` is much better for \n*ad hoc* computation of the *Query* problem. For the scenario in which the same \nsearch index is reused for many *Query* problems, `datasketch.MinHashLSH` is \nfaster than `SetSimilaritySearch` when the set sizes are large. This is \neasy to understand: the size of `datasketch.MinHash` is constant, wheres \na set can be arbitrarily large, so the query time for large sets is faster\nwhen sketch is used. However, when the set sizes become smaller, the sketch \nlooses its advantage.\n\n## Install\n\n```bash\npip install -U SetSimilaritySearch\n```\n\n## Library usage\n\nFor *All-Pairs*, it takes an input of a list of sets, and output pairs that\nmeet the similarity threshold.\n\n```python\nfrom SetSimilaritySearch import all_pairs\n\n# The input sets must be a Python list of iterables (i.e., lists or sets).\nsets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]\n# all_pairs returns an iterable of tuples.\npairs = all_pairs(sets, similarity_func_name=\"jaccard\", \n similarity_threshold=0.1)\nlist(pairs)\n# [(1, 0, 0.2), (2, 0, 0.5), (2, 1, 0.5), (3, 1, 0.2)]\n# Each tuple is (, , ).\n# The indexes are the list indexes of the input sets.\n```\n\nFor *Query*, it takes an input of a list of sets, and builds a search index\nthat can compute any number of queries. Currently the search index only \nsupports a static collection of sets with no updates.\n\n```python\nfrom SetSimilaritySearch import SearchIndex\n\n# The input sets must be a Python list of iterables (i.e., lists or sets).\nsets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]\n# The search index cannot be updated.\nindex = SearchIndex(sets, similarity_func_name=\"jaccard\", \n similarity_threshold=0.1)\n# The query function takes input a set.\nresults = index.query([5,3,4])\nresults\n# [(1, 1.0), (0, 0.2), (2, 0.5), (3, 0.2)]\n# Each tuple is (, ).\n# The index is the list index of the sets in the search index.\n```\n\n\nSupported similarity functions (more to come):\n* [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index): intersection size divided by union size; set `similarity_func_name=\"jaccard\"`.\n* [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity): intersection size divided by square root of the product of sizes; set `similarity_func_name=\"cosine\"`.\n* [Containment](https://ekzhu.github.io/datasketch/lshensemble.html#containment): intersection size divided by the size of the first set (or query set); set `similarity_func_name=\"containment\"`.\n\n\n## Command line usage\n\nYou can also use the command line program `all_pairs.py`.\nThe input must be **one or two** files with each line a **unique** `SetID Token` \ntuple. \nFor example:\n```\n# Line starts with # will be ignored.\n# Each line is , separate by a whitespace or tab.\n# Every line must be unique.\n1 a\n1 b\n1 c\n1 d\n2 a\n2 b\n2 c\n3 d\n3 e\n```\nWhen one input file is given, it computes *All-Pairs*; when two input files \nare given, it computes *Query* by building a search index on the first\ncollection and querying with sets from the second collection -- effectively\ncomputes cross-collection pairs.\n\nExample usage (*All-Pairs*):\n```bash\nall_pairs.py --input-sets testdata/example_input.txt \\\n --output-pairs testdata/example_output.txt \\\n --similarity-func jaccard \\\n --similarity-threshold 0.1\n```\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ekzhu/SetSimilaritySearch", "keywords": "set similarity search all pairs", "license": "", "maintainer": "", "maintainer_email": "", "name": "SetSimilaritySearch", "package_url": "https://pypi.org/project/SetSimilaritySearch/", "platform": "", "project_url": "https://pypi.org/project/SetSimilaritySearch/", "project_urls": { "Homepage": "https://github.com/ekzhu/SetSimilaritySearch" }, "release_url": "https://pypi.org/project/SetSimilaritySearch/0.1.7/", "requires_dist": [ "numpy", "coverage ; extra == 'test'", "nose ; extra == 'test'" ], "requires_python": "", "summary": "A Python library of set similarity search algorithms", "version": "0.1.7" }, "last_serial": 4982922, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "383737ce1d2caf36472dda42f0b284c7", "sha256": "9b6d7e7b03c791a471d917848350b615d318a7fe75dba3539949d540426f65cb" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "383737ce1d2caf36472dda42f0b284c7", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6136, "upload_time": "2018-11-18T00:18:00", "url": "https://files.pythonhosted.org/packages/59/05/e18b4b64033b58759e878cda40b035f04fde7a9d578ba36b49d251098a7e/SetSimilaritySearch-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0642f776c65c8aed94897ad7e163685d", "sha256": "8e9b8171b1ed4c7a54f83e99d6174e477fca78955aff1f90bf4d78684be5477a" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.1.tar.gz", "has_sig": false, "md5_digest": "0642f776c65c8aed94897ad7e163685d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8058, "upload_time": "2018-11-18T00:18:02", "url": "https://files.pythonhosted.org/packages/eb/f3/00eba67c6201a6fcb24959b2978ee257ec7179d9d8ef0695081d56e08322/SetSimilaritySearch-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "d37f3bf04367d15f40c25180ec00cf5f", "sha256": "984127530164b47fea690641280b1ccc1c4ca085b4dca03f8ba31ecdf288220f" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d37f3bf04367d15f40c25180ec00cf5f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10272, "upload_time": "2018-11-18T00:25:39", "url": "https://files.pythonhosted.org/packages/f9/97/487360a8149ae644361c146efbf0f811ac28d0c58ff33293ce0e4e769847/SetSimilaritySearch-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "660d7f4ca9dab91dba8f7c2ae286ea62", "sha256": "683dad4e0fa2124685c35a322eae52bc51bd45e3d498436ace0af232aaf7f640" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.2.tar.gz", "has_sig": false, "md5_digest": "660d7f4ca9dab91dba8f7c2ae286ea62", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8045, "upload_time": "2018-11-18T00:25:40", "url": "https://files.pythonhosted.org/packages/2b/45/81d461ccf5137ba043ecfc57e249ef5fedf9780c5d0526d0ff508364f5a1/SetSimilaritySearch-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "85d6aef047ad3eb0a14a397ed8eb2896", "sha256": "73309840cb478d291b4a7fd883668ab80e5fc17ae46a6977d4523b9ccc451072" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "85d6aef047ad3eb0a14a397ed8eb2896", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 13962, "upload_time": "2018-11-18T20:34:53", "url": "https://files.pythonhosted.org/packages/ea/ea/143a6ffd5f2f541b61a3b873e0f36398da459a75e0ed05a9e13d16fd789b/SetSimilaritySearch-0.1.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2060aebcd06cf89ccc8cd70a1dd593e6", "sha256": "e62016fccf3bd77df85a824bb75051423e765b191b4bbcd70d680a3ec48bb021" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.3.tar.gz", "has_sig": false, "md5_digest": "2060aebcd06cf89ccc8cd70a1dd593e6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10790, "upload_time": "2018-11-18T20:34:54", "url": "https://files.pythonhosted.org/packages/7d/27/05e806557b84c1d969ed200e87e77def554177a14ab682c478a999468ae4/SetSimilaritySearch-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "947d79ff6575fbdfe587fb0be17e4202", "sha256": "59e364d23d431b1dc155ad6c7d720b9b7c6b89868210f316b4a5e7a60d97ac47" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "947d79ff6575fbdfe587fb0be17e4202", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 14036, "upload_time": "2018-11-26T15:40:43", "url": "https://files.pythonhosted.org/packages/4d/b7/db3c5218677a041ba3625cd6826c004e722b1d4b492977a9711c45c58444/SetSimilaritySearch-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e9c6276d3371169e677effde0843b080", "sha256": "32de8d800455db14da60717ab1ae46310115633b4e49c3da4d93b0e0ea7e35ad" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.4.tar.gz", "has_sig": false, "md5_digest": "e9c6276d3371169e677effde0843b080", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10709, "upload_time": "2018-11-26T15:40:45", "url": "https://files.pythonhosted.org/packages/ef/3e/179fe98d3c1e905a5fddde0a50185e013fca4c25884e23f846bf1dd4c75d/SetSimilaritySearch-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "691b3f134ec82a22a4db193a19f87a64", "sha256": "109f06f358fd36c311f972a9ffd4777960233e2b999fde36f420fd90d34c0540" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "691b3f134ec82a22a4db193a19f87a64", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 9968, "upload_time": "2018-12-19T05:44:21", "url": "https://files.pythonhosted.org/packages/61/c8/0cd5e8f790f7fa86fa85c94f3be47d67d3ce2af18f792a178ad3b23823a3/SetSimilaritySearch-0.1.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "558e8619aae0b93e4b59f19aebdf7ec1", "sha256": "648248f0c067986d701b34a0ef892e2197dfa5f618e436b6bb8154593d8a514e" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.5.tar.gz", "has_sig": false, "md5_digest": "558e8619aae0b93e4b59f19aebdf7ec1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11632, "upload_time": "2018-12-19T05:44:22", "url": "https://files.pythonhosted.org/packages/f3/ec/d399bf51fd7498a8aea2f72e62fabc2986f7159bcca1ed6b90eb14eff2b5/SetSimilaritySearch-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "d23fa2beedc8e02940786e801f69d326", "sha256": "6309353c6abdd45a2e4ab57d9624966717d9c948235376ac99249ae6b77c9753" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d23fa2beedc8e02940786e801f69d326", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10143, "upload_time": "2018-12-19T16:34:37", "url": "https://files.pythonhosted.org/packages/6f/ec/7737cc802437087d08256bf8ba1e09a4f356fdb4707bcc486181c4628870/SetSimilaritySearch-0.1.6-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "25efff96da11f83afafd8fabf3630213", "sha256": "4cee7391bf29a4c5fec741bd6cbfe089d89cfe9393d43f4205cfc67aa430e5c9" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.6.tar.gz", "has_sig": false, "md5_digest": "25efff96da11f83afafd8fabf3630213", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11811, "upload_time": "2018-12-19T16:34:38", "url": "https://files.pythonhosted.org/packages/e7/de/b466a5bc257792947878da2151b8c26f565a787e6c85daace47ab652ca19/SetSimilaritySearch-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "1725a7b8261956f206a5037174a8d140", "sha256": "4d61b5ee5635276054e651070483fe2342786c3e6424cfb6734634afd893d5cf" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1725a7b8261956f206a5037174a8d140", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 14504, "upload_time": "2019-03-25T15:03:51", "url": "https://files.pythonhosted.org/packages/a8/02/5de5172cb7b990f85986ef405129c55b20b0798359f0879d0c649d112cdd/SetSimilaritySearch-0.1.7-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "506e6aa215c1c4c2f29b01a51bc77de4", "sha256": "5d95812e6237b877adbd991c14583e9191925f2809ed58aa1e9f34e9c8420722" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.7.tar.gz", "has_sig": false, "md5_digest": "506e6aa215c1c4c2f29b01a51bc77de4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11300, "upload_time": "2019-03-25T15:03:52", "url": "https://files.pythonhosted.org/packages/b7/80/86b61d248f6d96f3625c8c37d78fa1b3f019cbcfc98ef5b5b412a69777d5/SetSimilaritySearch-0.1.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1725a7b8261956f206a5037174a8d140", "sha256": "4d61b5ee5635276054e651070483fe2342786c3e6424cfb6734634afd893d5cf" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1725a7b8261956f206a5037174a8d140", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 14504, "upload_time": "2019-03-25T15:03:51", "url": "https://files.pythonhosted.org/packages/a8/02/5de5172cb7b990f85986ef405129c55b20b0798359f0879d0c649d112cdd/SetSimilaritySearch-0.1.7-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "506e6aa215c1c4c2f29b01a51bc77de4", "sha256": "5d95812e6237b877adbd991c14583e9191925f2809ed58aa1e9f34e9c8420722" }, "downloads": -1, "filename": "SetSimilaritySearch-0.1.7.tar.gz", "has_sig": false, "md5_digest": "506e6aa215c1c4c2f29b01a51bc77de4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11300, "upload_time": "2019-03-25T15:03:52", "url": "https://files.pythonhosted.org/packages/b7/80/86b61d248f6d96f3625c8c37d78fa1b3f019cbcfc98ef5b5b412a69777d5/SetSimilaritySearch-0.1.7.tar.gz" } ] }