{
    "info": {
        "author": "Dominik Leibenger",
        "author_email": "python-fastchunking@mails.dominik-leibenger.de",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 2 - Pre-Alpha",
            "Intended Audience :: Developers",
            "License :: OSI Approved :: Apache Software License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3",
            "Programming Language :: Python :: 3.5",
            "Topic :: Software Development :: Libraries :: Python Modules"
        ],
        "description": "===========================\nfastchunking Python library\n===========================\n\n.. image:: https://travis-ci.org/netleibi/fastchunking.svg?branch=master\n    :target: https://travis-ci.org/netleibi/fastchunking\n\n.. image:: https://badge.fury.io/py/fastchunking.svg\n    :target: https://badge.fury.io/py/fastchunking\n\n.. image:: https://readthedocs.org/projects/fastchunking/badge/?version=latest\n    :target: http://fastchunking.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n\nWhat it is\n----------\n\n`fastchunking` is a Python library that contains efficient and easy-to-use\nimplementations of string chunking algorithms.\n\nIt has been developed as part of the work [LS16]_ at CISPA, Saarland University.\n\nInstallation\n------------\n\n::\n\n    $ pip install fastchunking\n\n.. note:: For performance reasons, parts of this library are implemented in C++.\n\tInstallation from a source distribution, thus, requires availability of a\n\tcorrectly configured C++ compiler.\n\nUsage and Overview\n------------------\n\n`fastchunking` provides efficient implementations for different string chunking\nalgorithms, e.g., static chunking (SC) and content-defined chunking (CDC).\n\nStatic Chunking (SC)\n^^^^^^^^^^^^^^^^^^^^\n\nStatic chunking splits a message into fixed-size chunks.\n\nLet us consider a random example message that shall be chunked:\n    >>> import os\n    >>> message = os.urandom(1024*1024)\n\nStatic chunking is trivial when chunking a single message:\n    >>> import fastchunking\n    >>> sc = fastchunking.SC()\n    >>> chunker = sc.create_chunker(chunk_size=4096)\n    >>> chunker.next_chunk_boundaries(message)\n    [4096, 8192, 12288, ...]\n\nA large message can also be chunked in fragments, though:\n    >>> chunker = sc.create_chunker(chunk_size=4096)\n    >>> chunker.next_chunk_boundaries(message[:10240])\n    [4096, 8192]\n    >>> chunker.next_chunk_boundaries(message[10240:])\n    [2048, 6144, 10240, ...]\n\nContent-Defined Chunking (CDC)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n`fastchunking` supports content-defined chunking, i.e., chunking of messages\ninto fragments of variable lengths.\n\nCurrently, a chunking strategy based on Rabin-Karp rolling hashes is supported.\n\nAs a rolling hash computation on plain-Python strings is incredibly slow with\nany interpreter, most of the computation is performed by a C++ extension which\nis based on the `ngramhashing` library by Daniel Lemire, see:\nhttps://github.com/lemire/rollinghashcpp\n\nLet us consider a random message that should be chunked:\n    >>> import os\n    >>> message = os.urandom(1024*1024)\n\nWhen using static chunking, we have to specify a rolling hash window size (here:\n48 bytes) and an optional seed value that affects the pseudo-random distribution\nof the generated chunk boundaries.\n\nDespite that, usage is similar to static chunking:\n    >>> import fastchunking\n    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)\n    >>> chunker = cdc.create_chunker(chunk_size=4096)\n    >>> chunker.next_chunk_boundaries(message)\n    [7475, 10451, 12253, 13880, 15329, 19808, ...]\n    \nChunking in fragments is straightforward:\n    >>> chunker = cdc.create_chunker(chunk_size=4096)\n    >>> chunker.next_chunk_boundaries(message[:10240])\n    [7475]\n    >>> chunker.next_chunk_boundaries(message[10240:])\n    [211, 2013, 3640, 5089, 9568, ...]\n\nMulti-Level Chunking (ML-\\*)\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nMultiple chunkers of the same type (but with different chunk sizes) can be\nefficiently used in parallel, e.g., to perform multi-level chunking [LS16]_.\n\nAgain, let us consider a random message that should be chunked:\n    >>> import os\n    >>> message = os.urandom(1024*1024)\n\nUsage of multi-level-chunking, e.g., ML-CDC, is easy:\n    >>> import fastchunking\n    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)\n    >>> chunk_sizes = [1024, 2048, 4096]\n    >>> chunker = cdc.create_multilevel_chunker(chunk_sizes)\n    >>> chunker.next_chunk_boundaries_with_levels(message)\n    [(1049, 2), (1511, 1), (1893, 2), (2880, 1), (2886, 0),\n    (3701, 0), (4617, 0), (5809, 2), (5843, 0), ...]\n\nThe second value in each tuple indicates the highest chunk size that leads to\na boundary. Here, the first boundary is a boundary created by the chunker with\nindex 2, i.e., the chunker with 4096 bytes target chunk size.\n\n.. note::\n   Only the highest index is output if multiple chunkers yield the same\n   boundary.\n    \n.. warning::\n   Chunk sizes have to be passed in correct order, i.e., from lowest to highest\n   value.\n\nPerformance\n-----------\n\nComputation costs for `static chunking` are barely measurable: As chunking does\nnot depend on the actual message but only its length, computation costs are\nessentially limited to a single :code:`xrange` call.\n\n`Content-defined chunking`, however, is expensive: The algorithm has to compute\nhash values for rolling hash window contents at `every` byte position of the\nmessage that is to be chunked. To minimize costs, fastchunking works as follows:\n    \n    1. The message (fragment) is passed in its entirety to the C++ extension.\n    2. Chunking is performed within the C++ extension.\n    3. The resulting list of chunk boundaries is communicated back to Python and\n       converted into a Python list.\n\nBased on a 100 MiB random content, the author measured the following throughput\non an Intel Core i7-4770K in a single, non-representative test run using\nPython 3.5 (Windows x86-64):\n\n    =========== ==========\n    chunk size  throughput\n    =========== ==========\n    64 bytes    118 MiB/s\n    128 bytes   153 MiB/s\n    256 bytes   187 MiB/s\n    512 bytes   206 MiB/s\n    1024 bytes  221 MiB/s\n    2048 bytes  226 MiB/s\n    4096 bytes  231 MiB/s\n    8192 bytes  234 MiB/s\n    16384 bytes 233 MiB/s\n    32768 bytes 234 MiB/s\n    =========== ==========\n\nTesting\n-------\n\n`fastchunking` uses tox for testing, so simply run:\n\n::\n\n\t$ tox\n\nReferences:\n    .. [LS16] Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the\n       Most out of Untrusted Cloud Storage.\n       `arXiv:1606.03368 <http://arxiv.org/abs/1606.03368>`_",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/netleibi/fastchunking",
        "keywords": "text chunking,SC,static chunking,CDC,content-defined chunking,ML-*,multi-level chunking,ML-SC,ML-CDC,Rabin Karp,rolling hash",
        "license": "Apache Software License",
        "maintainer": "",
        "maintainer_email": "",
        "name": "fastchunking",
        "package_url": "https://pypi.org/project/fastchunking/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/fastchunking/",
        "project_urls": {
            "Homepage": "https://github.com/netleibi/fastchunking"
        },
        "release_url": "https://pypi.org/project/fastchunking/0.0.3/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Fast chunking library.",
        "version": "0.0.3"
    },
    "last_serial": 2641300,
    "releases": {
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "34d0c57a3bf39e7cc0213781917462c5",
                    "sha256": "e08825537f962471c53ebd4453612e4d80c29906aaae209d5ffdb083e0abb5ec"
                },
                "downloads": -1,
                "filename": "fastchunking-0.0.1.zip",
                "has_sig": false,
                "md5_digest": "34d0c57a3bf39e7cc0213781917462c5",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 30314,
                "upload_time": "2016-06-10T16:16:54",
                "url": "https://files.pythonhosted.org/packages/5f/8a/e8021aa8bbe4cca395d4da245e1b495de2ead69d60680467d96315e13a8d/fastchunking-0.0.1.zip"
            }
        ],
        "0.0.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "81dfabba4bedaff9f257273ca0ea2e2d",
                    "sha256": "a1719609cca099e7d5518481e2d194f19a5ae075866a4f8ce12b35a5c8860219"
                },
                "downloads": -1,
                "filename": "fastchunking-0.0.2.zip",
                "has_sig": false,
                "md5_digest": "81dfabba4bedaff9f257273ca0ea2e2d",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 30555,
                "upload_time": "2016-06-10T22:57:36",
                "url": "https://files.pythonhosted.org/packages/5c/d3/9d49991625c91377a148f6cad08e61e1e2cc195cbf378e14c8cde5db6536/fastchunking-0.0.2.zip"
            }
        ],
        "0.0.3": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "b99349db3b4c78bf095c904c17ce1bc7",
                    "sha256": "c23cae643f09176145cc1628d987e7dfcb9d4bae3e1a80d7b6f0fae83d0bcaa2"
                },
                "downloads": -1,
                "filename": "fastchunking-0.0.3.tar.gz",
                "has_sig": false,
                "md5_digest": "b99349db3b4c78bf095c904c17ce1bc7",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 24938,
                "upload_time": "2017-02-14T15:40:28",
                "url": "https://files.pythonhosted.org/packages/64/89/f9ede0f8c27c34ca7794beb3eede54aa630e7a1be9449aede50100db10ab/fastchunking-0.0.3.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "b99349db3b4c78bf095c904c17ce1bc7",
                "sha256": "c23cae643f09176145cc1628d987e7dfcb9d4bae3e1a80d7b6f0fae83d0bcaa2"
            },
            "downloads": -1,
            "filename": "fastchunking-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b99349db3b4c78bf095c904c17ce1bc7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 24938,
            "upload_time": "2017-02-14T15:40:28",
            "url": "https://files.pythonhosted.org/packages/64/89/f9ede0f8c27c34ca7794beb3eede54aa630e7a1be9449aede50100db10ab/fastchunking-0.0.3.tar.gz"
        }
    ]
}