{ "info": { "author": "Dominik Leibenger", "author_email": "python-fastchunking@mails.dominik-leibenger.de", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "===========================\nfastchunking Python library\n===========================\n\n.. image:: https://travis-ci.org/netleibi/fastchunking.svg?branch=master\n :target: https://travis-ci.org/netleibi/fastchunking\n\n.. image:: https://badge.fury.io/py/fastchunking.svg\n :target: https://badge.fury.io/py/fastchunking\n\n.. image:: https://readthedocs.org/projects/fastchunking/badge/?version=latest\n :target: http://fastchunking.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\nWhat it is\n----------\n\n`fastchunking` is a Python library that contains efficient and easy-to-use\nimplementations of string chunking algorithms.\n\nIt has been developed as part of the work [LS16]_ at CISPA, Saarland University.\n\nInstallation\n------------\n\n::\n\n $ pip install fastchunking\n\n.. note:: For performance reasons, parts of this library are implemented in C++.\n\tInstallation from a source distribution, thus, requires availability of a\n\tcorrectly configured C++ compiler.\n\nUsage and Overview\n------------------\n\n`fastchunking` provides efficient implementations for different string chunking\nalgorithms, e.g., static chunking (SC) and content-defined chunking (CDC).\n\nStatic Chunking (SC)\n^^^^^^^^^^^^^^^^^^^^\n\nStatic chunking splits a message into fixed-size chunks.\n\nLet us consider a random example message that shall be chunked:\n >>> import os\n >>> message = os.urandom(1024*1024)\n\nStatic chunking is trivial when chunking a single message:\n >>> import fastchunking\n >>> sc = fastchunking.SC()\n >>> chunker = sc.create_chunker(chunk_size=4096)\n >>> chunker.next_chunk_boundaries(message)\n [4096, 8192, 12288, ...]\n\nA large message can also be chunked in fragments, though:\n >>> chunker = sc.create_chunker(chunk_size=4096)\n >>> chunker.next_chunk_boundaries(message[:10240])\n [4096, 8192]\n >>> chunker.next_chunk_boundaries(message[10240:])\n [2048, 6144, 10240, ...]\n\nContent-Defined Chunking (CDC)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n`fastchunking` supports content-defined chunking, i.e., chunking of messages\ninto fragments of variable lengths.\n\nCurrently, a chunking strategy based on Rabin-Karp rolling hashes is supported.\n\nAs a rolling hash computation on plain-Python strings is incredibly slow with\nany interpreter, most of the computation is performed by a C++ extension which\nis based on the `ngramhashing` library by Daniel Lemire, see:\nhttps://github.com/lemire/rollinghashcpp\n\nLet us consider a random message that should be chunked:\n >>> import os\n >>> message = os.urandom(1024*1024)\n\nWhen using static chunking, we have to specify a rolling hash window size (here:\n48 bytes) and an optional seed value that affects the pseudo-random distribution\nof the generated chunk boundaries.\n\nDespite that, usage is similar to static chunking:\n >>> import fastchunking\n >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)\n >>> chunker = cdc.create_chunker(chunk_size=4096)\n >>> chunker.next_chunk_boundaries(message)\n [7475, 10451, 12253, 13880, 15329, 19808, ...]\n \nChunking in fragments is straightforward:\n >>> chunker = cdc.create_chunker(chunk_size=4096)\n >>> chunker.next_chunk_boundaries(message[:10240])\n [7475]\n >>> chunker.next_chunk_boundaries(message[10240:])\n [211, 2013, 3640, 5089, 9568, ...]\n\nMulti-Level Chunking (ML-\\*)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nMultiple chunkers of the same type (but with different chunk sizes) can be\nefficiently used in parallel, e.g., to perform multi-level chunking [LS16]_.\n\nAgain, let us consider a random message that should be chunked:\n >>> import os\n >>> message = os.urandom(1024*1024)\n\nUsage of multi-level-chunking, e.g., ML-CDC, is easy:\n >>> import fastchunking\n >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)\n >>> chunk_sizes = [1024, 2048, 4096]\n >>> chunker = cdc.create_multilevel_chunker(chunk_sizes)\n >>> chunker.next_chunk_boundaries_with_levels(message)\n [(1049, 2), (1511, 1), (1893, 2), (2880, 1), (2886, 0),\n (3701, 0), (4617, 0), (5809, 2), (5843, 0), ...]\n\nThe second value in each tuple indicates the highest chunk size that leads to\na boundary. Here, the first boundary is a boundary created by the chunker with\nindex 2, i.e., the chunker with 4096 bytes target chunk size.\n\n.. note::\n Only the highest index is output if multiple chunkers yield the same\n boundary.\n \n.. warning::\n Chunk sizes have to be passed in correct order, i.e., from lowest to highest\n value.\n\nPerformance\n-----------\n\nComputation costs for `static chunking` are barely measurable: As chunking does\nnot depend on the actual message but only its length, computation costs are\nessentially limited to a single :code:`xrange` call.\n\n`Content-defined chunking`, however, is expensive: The algorithm has to compute\nhash values for rolling hash window contents at `every` byte position of the\nmessage that is to be chunked. To minimize costs, fastchunking works as follows:\n \n 1. The message (fragment) is passed in its entirety to the C++ extension.\n 2. Chunking is performed within the C++ extension.\n 3. The resulting list of chunk boundaries is communicated back to Python and\n converted into a Python list.\n\nBased on a 100 MiB random content, the author measured the following throughput\non an Intel Core i7-4770K in a single, non-representative test run using\nPython 3.5 (Windows x86-64):\n\n =========== ==========\n chunk size throughput\n =========== ==========\n 64 bytes 118 MiB/s\n 128 bytes 153 MiB/s\n 256 bytes 187 MiB/s\n 512 bytes 206 MiB/s\n 1024 bytes 221 MiB/s\n 2048 bytes 226 MiB/s\n 4096 bytes 231 MiB/s\n 8192 bytes 234 MiB/s\n 16384 bytes 233 MiB/s\n 32768 bytes 234 MiB/s\n =========== ==========\n\nTesting\n-------\n\n`fastchunking` uses tox for testing, so simply run:\n\n::\n\n\t$ tox\n\nReferences:\n .. [LS16] Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the\n Most out of Untrusted Cloud Storage.\n `arXiv:1606.03368 `_", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/netleibi/fastchunking", "keywords": "text chunking,SC,static chunking,CDC,content-defined chunking,ML-*,multi-level chunking,ML-SC,ML-CDC,Rabin Karp,rolling hash", "license": "Apache Software License", "maintainer": "", "maintainer_email": "", "name": "fastchunking", "package_url": "https://pypi.org/project/fastchunking/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/fastchunking/", "project_urls": { "Homepage": "https://github.com/netleibi/fastchunking" }, "release_url": "https://pypi.org/project/fastchunking/0.0.3/", "requires_dist": null, "requires_python": "", "summary": "Fast chunking library.", "version": "0.0.3" }, "last_serial": 2641300, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "34d0c57a3bf39e7cc0213781917462c5", "sha256": "e08825537f962471c53ebd4453612e4d80c29906aaae209d5ffdb083e0abb5ec" }, "downloads": -1, "filename": "fastchunking-0.0.1.zip", "has_sig": false, "md5_digest": "34d0c57a3bf39e7cc0213781917462c5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30314, "upload_time": "2016-06-10T16:16:54", "url": "https://files.pythonhosted.org/packages/5f/8a/e8021aa8bbe4cca395d4da245e1b495de2ead69d60680467d96315e13a8d/fastchunking-0.0.1.zip" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "81dfabba4bedaff9f257273ca0ea2e2d", "sha256": "a1719609cca099e7d5518481e2d194f19a5ae075866a4f8ce12b35a5c8860219" }, "downloads": -1, "filename": "fastchunking-0.0.2.zip", "has_sig": false, "md5_digest": "81dfabba4bedaff9f257273ca0ea2e2d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30555, "upload_time": "2016-06-10T22:57:36", "url": "https://files.pythonhosted.org/packages/5c/d3/9d49991625c91377a148f6cad08e61e1e2cc195cbf378e14c8cde5db6536/fastchunking-0.0.2.zip" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "b99349db3b4c78bf095c904c17ce1bc7", "sha256": "c23cae643f09176145cc1628d987e7dfcb9d4bae3e1a80d7b6f0fae83d0bcaa2" }, "downloads": -1, "filename": "fastchunking-0.0.3.tar.gz", "has_sig": false, "md5_digest": "b99349db3b4c78bf095c904c17ce1bc7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24938, "upload_time": "2017-02-14T15:40:28", "url": "https://files.pythonhosted.org/packages/64/89/f9ede0f8c27c34ca7794beb3eede54aa630e7a1be9449aede50100db10ab/fastchunking-0.0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b99349db3b4c78bf095c904c17ce1bc7", "sha256": "c23cae643f09176145cc1628d987e7dfcb9d4bae3e1a80d7b6f0fae83d0bcaa2" }, "downloads": -1, "filename": "fastchunking-0.0.3.tar.gz", "has_sig": false, "md5_digest": "b99349db3b4c78bf095c904c17ce1bc7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24938, "upload_time": "2017-02-14T15:40:28", "url": "https://files.pythonhosted.org/packages/64/89/f9ede0f8c27c34ca7794beb3eede54aa630e7a1be9449aede50100db10ab/fastchunking-0.0.3.tar.gz" } ] }