{ "info": { "author": "Justin Boylan-Toomey", "author_email": "justin.boylan-toomey@outlook.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# SnaPy\n[![Build Status](https://travis-ci.com/justinnbt/SnaPy.svg?branch=master)](https://travis-ci.com/justinnbt/SnaPy)\n[![PyPI version](https://badge.fury.io/py/snapy.svg)](https://badge.fury.io/py/snapy)\n[![Downloads](https://pepy.tech/badge/snapy)](https://pepy.tech/project/snapy)\n[![Python Version](https://img.shields.io/badge/python-3.6%20%7C%203.7-blue.svg)](https://pypi.org/project/snapy/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n
\nPython library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing.
\nAs described in Mining Massive Datasets http://infolab.stanford.edu/~ullman/mmds/ch3.pdf.\n\n## Installation\nInstall SnaPy using: `pip install snapy`
\nInstall mmh3 library needed for Minhash using: `pip install mmh3`\n\n## Quickstart Example\n``` python\nfrom snapy import MinHash, LSH\n\ncontent = [\n 'Jupiter is primarily composed of hydrogen with a quarter of its mass '\n 'being helium',\n 'Jupiter moving out of the inner Solar System would have allowed the '\n 'formation of inner planets.',\n 'A helium atom has about four times as much mass as a hydrogen atom, so '\n 'the composition changes when described as the proportion of mass '\n 'contributed by different atoms.',\n 'Jupiter is primarily composed of hydrogen and a quarter of its mass '\n 'being helium',\n 'A helium atom has about four times as much mass as a hydrogen atom and '\n 'the composition changes when described as a proportion of mass '\n 'contributed by different atoms.',\n 'Theoretical models indicate that if Jupiter had much more mass than it '\n 'does at present, it would shrink.',\n 'This process causes Jupiter to shrink by about 2 cm each year.',\n 'Jupiter is mostly composed of hydrogen with a quarter of its mass '\n 'being helium',\n 'The Great Red Spot is large enough to accommodate Earth within its '\n 'boundaries.'\n]\n\nlabels = [1, 2, 3, 4, 5, 6, 7, 8, 9]\nseed = 3\n\n\n# Create MinHash object.\nminhash = MinHash(content, n_gram=9, permutations=100, hash_bits=64, seed=3)\n\n\n# Create LSH model.\nlsh = LSH(minhash, labels, no_of_bands=50)\n\n\n# Query to find near duplicates for text 1.\nprint(lsh.query(1, min_jaccard=0.5))\n>>> [8, 4]\n\n\n# Generate minhash signature and add new texts to LSH model.\nnew_text = [\n 'Jupiter is primarily composed of hydrogen with a quarter of its mass being \n 'helium',\n 'Jupiter moving out of the inner Solar System would have allowed the '\n 'formation of inner planets.',\n]\n\nnew_labels = ['doc1', 'doc2']\n\nnew_minhash = MinHash(new_text, n_gram=9, permutations=100, hash_bits=64, seed=3)\n\nlsh.update(new_minhash, new_labels)\n\n\n# Check contents of documents.\nprint(lsh.contains())\n>>> [1, 2, 3, 4, 5, 6, 7, 8, 9, 'doc1', 'doc2']\n\n\n# Remove text and label from model.\nlsh.remove(5)\nprint(lsh.contains())\n>>> [1, 2, 3, 4, 6, 7, 8, 9, 'doc1', 'doc2']\n\n\n# Return adjacency list for all similar texts.\nadjacency_list = lsh.adjacency_list(min_jaccard=0.55)\nprint(adjacency_list)\n>>> {\n 1: ['doc1', 4],\n 2: ['doc2'], \n 3: [], \n 4: [1, 'doc1'], \n 6: [], \n 7: [], \n 8: [], \n 9: [], \n 'doc1': [1, 4], \n 'doc2': [2]\n }\n\n\n# Returns edge list for use creating a weighted graph.\nedge_list = lsh.edge_list(min_jaccard=0.5, jaccard_weighted=True)\nprint(edge_list)\n>>> [\n ('doc2', 2, 1.0), \n ('doc1', 1, 1.0), \n ('doc1', 8, 0.5), \n ('doc1', 4, 0.58), \n (8, 1, 0.5), \n (4, 1, 0.58)\n ]\n\n```\n## API Guide\n\n### MinHash\nCreates a MinHash object that contains matrix of Minhash Signatures for each text.\n\n#### MinHash Parameters\n```MinHash(text, n_gram=9, n_gram_type='char', permutations=100, hash_bits=64, seed=None)```

\ntext: {list or ndarray}
\nIterable containing strings of text for each text in a corpus.

\nn_gram: int, optional, default: 9
\nSize of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected dependant on average text length as too low a shingle size will yield false similarities, whereas too high a shingle size will fail to return similar documents.

\nn_gram_type: str, optional, default: 'char'
\nType of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into overlapping sequences of words.

\npermutations: int, optional, default: 100
\nNumber of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm will take to run.

\nhash_bits: int, optional, default: 64
\nHash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit. Hash value size should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false hash collisions leading to false similarities between documents for larger corpora of texts.

\nmethod: str, optional, default: 'multi_hash'
\nMethod for random sampling via hashing, must be 'multi_hash' or 'k_smallest_values'. If multi_hash selected texts are hashed once per permutation and the minimum hash value selected each time to construct a signature. If k_smallest_values selected each text is hashed once and k smallest values selected for k permutations. This method is much faster than multi_hash but far less stable.

\nseed: int, optional, default: None
\nSeed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.

\n\n#### MinHash Properties\nn_gram: int
\n```.n_gram```
\nReturns size of each overlapping text shingle used to create minhash signatures.

\nn_gram_type: int
\n```.n_gram_type```
\nReturns type of n-gram used for text shingling.

\npermutations: int
\n```.permutations```
\nReturns number of permutations used to create signatures.

\nhash_bits: int
\n```.hash_bits```
\nReturns hash value size used to create signatures.

\nmethod: str
\n```.method```
\nReturns hashing method used in minhash function.

\nseed: int
\n```.seed```
\nReturns seed value used to generate random hashes in minhash function.

\nsignatures: ndarray
\n```.signatures```
\nReturns matrix of text signatures generated by minhash function.
\nn = text row, m = selected permutations.

\n\n### LSH\nCreates an LSH model of text similarity that can be used to return similar texts based on estimated Jaccard similarity.\n\n#### LSH Parameters\n```LSH(minhash=None, labels=None, no_of_bands=None)```

\nminhash, optional, default: None
\nMinhash object containing minhash signatures returned by MinHash class.

\nlabels: {list or ndarray}, optional, default: None
\nList, array or Pandas series containing unique labels for each text in minhash object signature. This should be provided in the same order as texts passed to the MinHash class. Example labels include filepaths and database ids.

\nno_of_bands: int, optional, default: permutations // 2
\nNumber of bands to break minhash signature into before hashing into buckets. A smaller number of bands will result in a stricter algorithm, requiring larger possibly leading to false negatives missing some similar texts, whereas a higher number may lead to false similarities.

\n\n#### LSH Methods\nupdate
\nUpdates model from a MinHash object containing signatures generated from new texts and their corresponding labels.
\n```.update(minhash, new_labels)```
\nminhash: MinHash object containing signatures of new texts, parameters must match any previous MinHash objects.
\nnew_labels: List, array or Pandas series containing text labels.

\nquery
\nTakes a label and returns the labels of any similar texts.
\n```.query(label, min_jaccard=None, sensitivity=1)```
\nlabel: Label of text to return list of similar texts for.
\nmin_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
\nsensitivity: Number of buckets texts must share to be returned as similar.

\nremove
\nRemove file label and minhash signature from model.
\n```.def remove(label):```
\nlabel: Label of text to remove from LSH model.

\ncontains
\nReturns list of labels contained in the model.
\n```.contains()```

\nadjacency_list
\nReturns an adjacency list that can be used to create a text similarity graph.
\n```.adjacency_list(min_jaccard=None, sensitivity=1)```
\nmin_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
\nsensitivity: Number of buckets texts must share to be returned as similar.

\nedge_list
\nReturns a list of edges as tuples of similar pairs, that can be used to create a text similarity graph.
\n```.edge_list(min_jaccard=None, jaccard_weighted=False, sensitivity=1)```
\nmin_jaccard: Jaccard similarity threshold texts have to exceed to be returned as a pair of similar texts.
\njaccard_weighted: Return a list of edges as 3 tuples including text similarity pairs and estimated Jaccard similarity score.
\nsensitivity: Number of buckets texts must share to be returned as similar.

\n\n#### LSH Properties\nno_of_bands: int
\n```.no_of_bands```
\nNumber of bands used in LSH model.

\npermutations: int
\n```.permutations```
\nNumber of permutations used to create minhash signatures used in LSH model.

\n\n## Contributing\nContributions are very welcome, message us or just submit a pull request!\n\n## Authors\nJustin Boylan-Toomey\n\n## License\nMIT License\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/justinnbt/SnaPy", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "snapy", "package_url": "https://pypi.org/project/snapy/", "platform": "", "project_url": "https://pypi.org/project/snapy/", "project_urls": { "Homepage": "https://github.com/justinnbt/SnaPy" }, "release_url": "https://pypi.org/project/snapy/1.0.2/", "requires_dist": [ "numpy" ], "requires_python": "", "summary": "SnaPy is a Python library for detecting near duplicate texts using Minhash and Locality Sensitive Hashing.", "version": "1.0.2" }, "last_serial": 5730004, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "04ada167ca22e445cb968496602e2788", "sha256": "d90edce4fdcba047a3500a19325a0c087e30c959c014d245313b1670376ed08e" }, "downloads": -1, "filename": "snapy-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "04ada167ca22e445cb968496602e2788", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6935, "upload_time": "2019-06-16T11:50:06", "url": "https://files.pythonhosted.org/packages/f7/e6/d4acdc34d5e2e4b5245208acfff394ee6bab5644f199ef1994e1ac8e1b2c/snapy-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0576f22f9d14921ea2cd9154e35b970c", "sha256": "f1136cc9f57e222607673cef708b0c6af3ecf8c55f8e9c7af4efc65bfb7e5973" }, "downloads": -1, "filename": "snapy-0.0.1.tar.gz", "has_sig": false, "md5_digest": "0576f22f9d14921ea2cd9154e35b970c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5369, "upload_time": "2019-06-16T11:50:09", "url": "https://files.pythonhosted.org/packages/2c/22/9f42282603498eaea5822f9fee6d7ef5e08d7c1247a3554fc034eaefad40/snapy-0.0.1.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "80d27144498111132537c0c4b8e5fa60", "sha256": "9d53f61fdbfa7ffa64a2b3e41cf12f64ef75de252c3b33bc8c7048013e57253a" }, "downloads": -1, "filename": "snapy-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "80d27144498111132537c0c4b8e5fa60", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9915, "upload_time": "2019-08-26T01:18:46", "url": "https://files.pythonhosted.org/packages/0a/ae/ee899304bb89356d7c1d2638228af417d534e69a4a948b04e6852b6f42a2/snapy-1.0.0-py3-none-any.whl" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "c0ca3b78414471a1c2a10828c6a42644", "sha256": "91dc6c883be8a6f53a8fb5d2c8f88b624e3daab8038dabd2e32b8cc145aba9d0" }, "downloads": -1, "filename": "snapy-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "c0ca3b78414471a1c2a10828c6a42644", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9913, "upload_time": "2019-08-26T09:51:06", "url": "https://files.pythonhosted.org/packages/32/ee/1d9c5fbb990644ac609bf6e012e1aeb6790f09ab242fa99cd7c064d9003c/snapy-1.0.1-py3-none-any.whl" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "f486ca3ba9678476da87bdccba2c7cfd", "sha256": "c33e28d6e6e98b5e56ba0acf38c898f3d21aa160c2d205d81222ef9b378fd2ea" }, "downloads": -1, "filename": "snapy-1.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "f486ca3ba9678476da87bdccba2c7cfd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9921, "upload_time": "2019-08-26T10:07:31", "url": "https://files.pythonhosted.org/packages/68/59/cdf153f7a391593060a3fcd06f13f8127dc79f675d99ef576b69f49365a0/snapy-1.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "23cb7c5bc023ad8e660dd7bd0ef9f86d", "sha256": "d318641933689fdfccf5c9ba9c932ccf45a0157b07719d6b3ad892f7f893dbfa" }, "downloads": -1, "filename": "snapy-1.0.2.tar.gz", "has_sig": false, "md5_digest": "23cb7c5bc023ad8e660dd7bd0ef9f86d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11516, "upload_time": "2019-08-26T10:07:32", "url": "https://files.pythonhosted.org/packages/67/1c/cd2f7ced008ad664c84dfd904b56d9f3c9b1b08e505df9b7e0dc5f61cf0d/snapy-1.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f486ca3ba9678476da87bdccba2c7cfd", "sha256": "c33e28d6e6e98b5e56ba0acf38c898f3d21aa160c2d205d81222ef9b378fd2ea" }, "downloads": -1, "filename": "snapy-1.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "f486ca3ba9678476da87bdccba2c7cfd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9921, "upload_time": "2019-08-26T10:07:31", "url": "https://files.pythonhosted.org/packages/68/59/cdf153f7a391593060a3fcd06f13f8127dc79f675d99ef576b69f49365a0/snapy-1.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "23cb7c5bc023ad8e660dd7bd0ef9f86d", "sha256": "d318641933689fdfccf5c9ba9c932ccf45a0157b07719d6b3ad892f7f893dbfa" }, "downloads": -1, "filename": "snapy-1.0.2.tar.gz", "has_sig": false, "md5_digest": "23cb7c5bc023ad8e660dd7bd0ef9f86d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11516, "upload_time": "2019-08-26T10:07:32", "url": "https://files.pythonhosted.org/packages/67/1c/cd2f7ced008ad664c84dfd904b56d9f3c9b1b08e505df9b7e0dc5f61cf0d/snapy-1.0.2.tar.gz" } ] }