{ "info": { "author": "Ilia Barahovsky", "author_email": "barahilia@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "*******\nspamspy\n*******\n\n.. image:: https://travis-ci.org/barahilia/spamspy.svg?branch=master\n :target: https://travis-ci.org/barahilia/spamspy\n :alt: Build status\n\nPure Python implementation of *spamsum* (also known from *ssdeep*) with improvements and extensions.\n\nIntroduction\n============\n\nConventional hash and checksum tools like\n`sha256 `_ and\n`CRC32 `_\nare good at grasping **exact** file content.\nThey result in very different hash after any byte or even bit change.\n`Spamsum `_\nwas developed by Andrew Tridgell to compare **similar** files.\nAfter several changes in few parts of a file hash changes only slightly.\nSpamsum comes in two parts: a \"fuzzy\" hash generating similar output for similar files,\nand a slightly adopted edit distance implementation to estimate how similar the hashes\nand their files are.\n\nSpamsum hash length is bound by 64 characters. Block size is chosen at about 1/64 of the\nfile size. Every block is hashed to one character. File division to blocks depends on\nfile content so that on average block size will be like the\ntarget. This allows to get same blocks with same one-character hash even if other parts of the files are different.\n\n* comparable files should be up to two times in size one from another, due to block size\n* for the same reason you can **not** use spamsum to search for some large chunk appearance in files\n* all 64 blocks might appear in the beginning and spamsum hash might \"cover\" only the first part of the file\n* you should consider file format before usage; zip archives to be decompressed before hashing\n* this hash is **not** secure and does **not** garantee similarity; it should be easy enough to generate\n file having specific hash\n\nSpamspy\n=======\n\nIn this project both spamsum components - hash and edit distance, - are implemented in pure Python. As well\na number of improvements are made. Spamsum hash length limitation is dropped, and so both block length and\nthe maximal hash length are configurable and can depend on the file size.\n\nAnother system for library search based on ngrams is introduced. Given a large library of files and one\nnew file we want to check if any large part from the new file appears in the library. For this all files\nare hashed with spamsum with some constant block size. Then hashes are split to ngrams. Library preserves\na mapping from ngram to its original files. Search in library iterates over all ngrams of the new file hash\nand yields the library file(s) with most matches. Assuming block size of 10KB and ngram size of 5, single\nmatch means some chance of about 50KB match in the files themselves.\n\nUsage\n=====\n\nThe package depends on standard modules only. Install with::\n\n pip install spamspy\n\nIn code:\n\n.. code-block:: python\n\n from spamspy.spamsum import spamsum\n s1 = 'some long text' # or open('first.txt').read()\n hash1 = spamsum(s1)\n hash2 = spamsum('somewhat long telegram')\n\n from spamsum.edit_dist import edit_dist\n\n ed = edit_dist(hash1, hash2) # large number means more difference\n\nIn shell:\n\n.. code-block:: sh\n\n spamsum first.txt # 3:uqHRXLAHBn:K2\n edit_dist uqHRXLAHBn uqHRXLAHc # 2 - two changes from first hash to the second\n\n ngram_spy update first.txt # hashes and saves ngrams in ./registry.dat\n ngram_spy search second.txt # (first.txt, 23) - matches on 23 ngrams\n ngram_spy search other.txt # (None, 0) - no matches found\n\nPerformance\n===========\n\nIn Python 2.7 spamsum runs about 600 times slower than the native implementation.\nThe good news is that in PyPy 5.1.1 it is only 15 times slower than the native which should be\ntolerable in many applications. This would be the price for extensions, the new ngrams\nalgorithm and for convenience of in-Python world. If blazing speed is the must, then\nthe new code should be ported back.\n\nLicense\n=======\n\nCopyright (c) 2017 Ilia Barahovsky\n\nThis project is distributed under MIT License.\n\nThe *spamsum* algorithm and tool was developed by Andrew Tridgell as an\nefficient similarity comparison between two files and a spam filter for mail\nclient. It is licensed under the GNU General Public License version 2 or under\nthe terms of the Perl Artistic license. It was copied without modifications from\nhttps://www.samba.org/ftp/unpacked/junkcode/spamsum/ to ``original/`` for\nverification of the Python port.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/barahilia/spamspy", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "spamspy", "package_url": "https://pypi.org/project/spamspy/", "platform": "", "project_url": "https://pypi.org/project/spamspy/", "project_urls": { "Homepage": "https://github.com/barahilia/spamspy" }, "release_url": "https://pypi.org/project/spamspy/0.1/", "requires_dist": null, "requires_python": "", "summary": "", "version": "0.1" }, "last_serial": 3245858, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "2d0dc17b2fea10fe82cbf581ebdc4608", "sha256": "2cbc8a0d43afa26984a9d1f97074bb764a5be0ac459b15253b4125651031f4e9" }, "downloads": -1, "filename": "spamspy-0.1.tar.gz", "has_sig": false, "md5_digest": "2d0dc17b2fea10fe82cbf581ebdc4608", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6318, "upload_time": "2017-10-12T18:03:54", "url": "https://files.pythonhosted.org/packages/10/1b/a9acbb3d67aaa32e5952f3f64e77402944de1e108334a977d81b5a83294d/spamspy-0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2d0dc17b2fea10fe82cbf581ebdc4608", "sha256": "2cbc8a0d43afa26984a9d1f97074bb764a5be0ac459b15253b4125651031f4e9" }, "downloads": -1, "filename": "spamspy-0.1.tar.gz", "has_sig": false, "md5_digest": "2d0dc17b2fea10fe82cbf581ebdc4608", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6318, "upload_time": "2017-10-12T18:03:54", "url": "https://files.pythonhosted.org/packages/10/1b/a9acbb3d67aaa32e5952f3f64e77402944de1e108334a977d81b5a83294d/spamspy-0.1.tar.gz" } ] }