{ "info": { "author": "Jurismarches", "author_email": "contact@jurismarches.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)", "Programming Language :: Python", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "Dask indexed gzip\n##################\n\n|pypi-version| |travis| |coveralls|\n\nAn implementation compatible with `dask read_text`_ interface,\nthan can chunk a gzipped text file into several partitions,\nthanks to an index, provided by `indexed_gzip`_\n\nThis is useful when your data resides in a big gzipped file,\nyet you want to leverage dask parallelism capabilities.\n\nSample session\n---------------\n\n::\n\n >>> import os\n >>> import dask_igzip\n\n.. initalization\n\n >>> data_path = os.path.join(os.path.dirname(dask_igzip.__file__), \"..\", \"test\", \"data\")\n\n::\n\n >>> source = os.path.join(data_path, \"sample.txt.gz\")\n >>> # 3 lines per chunk (obviously this is for demoing)\n >>> bag = dask_igzip.read_text(source, chunk_size=3, encoding=\"utf-8\")\n >>> lines = bag.take(4, npartitions=2)\n >>> print(\"\".join(lines).strip())\n a first sentence\n a second sentence\n a third sentence\n a fourth sentence\n >>> bag.str.upper().str.strip().compute()[8]\n 'LINE 9'\n\nWhy ?\n-----\n\nDask `read_text` creates a unique partition if you provide it with a gzip file.\nThis limitations comes from the fact that\nthere is no way to split the gzip file in a predictable yet coherent way.\n\nThis project provides an implementation where the gzip is indexed,\nthen lines positions are also indexed,\nso that reading the text can be done by chunk (thus enabling parallelism).\nOn first run, indexes are saved on disk, so that subsequent runs are fast.\n\n.. _`indexed_gzip`: https://githuib.com/pauldmccarthy/indexed_gzip\n.. _`dask read_text`: https://dask.pydata.org/en/latest/bag-creation.html#db-read-text\n\n\n.. |pypi-version| image:: https://img.shields.io/pypi/v/dask-igzip.svg\n :target: https://pypi.python.org/pypi/dask-igzip\n :alt: Latest PyPI version\n.. |travis| image:: http://img.shields.io/travis/jurismarches/dask_igzip/master.svg?style=flat\n :target: https://travis-ci.org/jurismarches/dask_igzip\n.. |coveralls| image:: http://img.shields.io/coveralls/jurismarches/dask_igzip/master.svg?style=flat\n :target: https://coveralls.io/r/jurismarches/dask_igzip\n\n\n\n\nChangelog\n#########\n\nThe format is based on `Keep a Changelog`_\nand this project tries to adhere to `Semantic Versioning`_.\n\n.. _`Keep a Changelog`: http://keepachangelog.com/en/1.0.0/\n.. _`Semantic Versioning`: http://semver.org/spec/v2.0.0.html\n\n\n0.2.0 - 2018-06-20\n==================\n\nNew\n---\n\n- read_text now accept a limit parameter to limit the global amount of lines to read\n\nChanged\n-------\n\n- incompatible format for lines index\n\n0.1.0 - 2018-06-19\n==================\n\nNew\n---\n\n- initial release\n- 100% code coverage\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jurismarches/dask-igzip", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "dask-igzip", "package_url": "https://pypi.org/project/dask-igzip/", "platform": "", "project_url": "https://pypi.org/project/dask-igzip/", "project_urls": { "Homepage": "https://github.com/jurismarches/dask-igzip" }, "release_url": "https://pypi.org/project/dask-igzip/0.2.0/", "requires_dist": [ "dask[bag] (>=0.17.5)", "indexed-gzip (>=0.8.5)", "distributed (>=1.22); extra == 'tests'", "flake8 (>=3.5.0); extra == 'tests'", "pytest-cov (>=2.5.1); extra == 'tests'", "pytest (>=3.4.2); extra == 'tests'" ], "requires_python": "", "summary": "dask chunked read_text on gzip file", "version": "0.2.0" }, "last_serial": 3982522, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "726a2500fb5552a72800758439a2f19e", "sha256": "c45d9941474953cf0d5c001bcf4ac811c07fe4b9ababb514480cf1dd4f9b6dde" }, "downloads": -1, "filename": "dask-igzip-0.1.0.linux-x86_64.tar.gz", "has_sig": true, "md5_digest": "726a2500fb5552a72800758439a2f19e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5537, "upload_time": "2018-06-19T21:19:43", "url": "https://files.pythonhosted.org/packages/b9/90/18b2fa4a02bbdd4f6d33908596132440343606b0e321b51af4dd30aced3a/dask-igzip-0.1.0.linux-x86_64.tar.gz" }, { "comment_text": "", "digests": { "md5": "6a65d8614112a14a7adfc1e31b32218c", "sha256": "8ba4a036d69c42f9af3a33f719692fef43300cd768b6e7740a1b5e4107822108" }, "downloads": -1, "filename": "dask_igzip-0.1.0-py3-none-any.whl", "has_sig": true, "md5_digest": "6a65d8614112a14a7adfc1e31b32218c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5676, "upload_time": "2018-06-19T21:19:41", "url": "https://files.pythonhosted.org/packages/de/01/f7fd4f15de000738c300a89ecd626ffd3d159768f7fa306d9cb0b992f167/dask_igzip-0.1.0-py3-none-any.whl" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "2128d6bb82eae5d269a5e9233060fed9", "sha256": "cd3f3ddc5ed99ce77f6cefdf2b79e91463b2a41434461b24e1855566b15357df" }, "downloads": -1, "filename": "dask-igzip-0.2.0.linux-x86_64.tar.gz", "has_sig": true, "md5_digest": "2128d6bb82eae5d269a5e9233060fed9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8007, "upload_time": "2018-06-20T18:35:03", "url": "https://files.pythonhosted.org/packages/75/b1/56377248714f959b72016566e93128099166231402207e18d71851aa24f1/dask-igzip-0.2.0.linux-x86_64.tar.gz" }, { "comment_text": "", "digests": { "md5": "fd9fd28f734f49b3e4ce3832a62008a5", "sha256": "b313cd8ad0b13062b6b73fd3728e07649708197768c95e8ad7ecbe560a73e57a" }, "downloads": -1, "filename": "dask_igzip-0.2.0-py3-none-any.whl", "has_sig": true, "md5_digest": "fd9fd28f734f49b3e4ce3832a62008a5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7712, "upload_time": "2018-06-20T18:35:01", "url": "https://files.pythonhosted.org/packages/79/23/303e351b201218424e1dd016d0a1d11d327d09657993057458ba20814eed/dask_igzip-0.2.0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2128d6bb82eae5d269a5e9233060fed9", "sha256": "cd3f3ddc5ed99ce77f6cefdf2b79e91463b2a41434461b24e1855566b15357df" }, "downloads": -1, "filename": "dask-igzip-0.2.0.linux-x86_64.tar.gz", "has_sig": true, "md5_digest": "2128d6bb82eae5d269a5e9233060fed9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8007, "upload_time": "2018-06-20T18:35:03", "url": "https://files.pythonhosted.org/packages/75/b1/56377248714f959b72016566e93128099166231402207e18d71851aa24f1/dask-igzip-0.2.0.linux-x86_64.tar.gz" }, { "comment_text": "", "digests": { "md5": "fd9fd28f734f49b3e4ce3832a62008a5", "sha256": "b313cd8ad0b13062b6b73fd3728e07649708197768c95e8ad7ecbe560a73e57a" }, "downloads": -1, "filename": "dask_igzip-0.2.0-py3-none-any.whl", "has_sig": true, "md5_digest": "fd9fd28f734f49b3e4ce3832a62008a5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7712, "upload_time": "2018-06-20T18:35:01", "url": "https://files.pythonhosted.org/packages/79/23/303e351b201218424e1dd016d0a1d11d327d09657993057458ba20814eed/dask_igzip-0.2.0-py3-none-any.whl" } ] }