{ "info": { "author": "Andrii Stepaniuk", "author_email": "andrii.stepaniuk@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "### split-gzip-upload\n\n#### Description\nTool to split stdin, gzip it and upload to s3.\n\n#### Prerequisites\nAWS [CLI](https://aws.amazon.com/cli/) should be installed and \n[configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-quick-configuration)\n\n#### Installation\n```\npip install split-gzip-upload-tool\n```\n\n#### Usage examples\nLet's say you need to move big amount of data from AWS Aurora Postgres to AWS Redshift.\nYou can do it with just:\n```\npsql -c '\\copy some_big_table TO PROGRAM \"aws s3 cp - s3://your-bucket/some_big_table.csv\"'\n```\nto copy all your data as CSV file (big one). \nOr you can compress it with gzip (or bzip2 etc.) first:\n```\npsql -c '\\copy some_big_table TO PROGRAM \"gzip | aws s3 cp - s3://your-bucket/some_big_table.csv.gz\"'\n```\nBut this will give you 1 compressed file, instead of N files, as Redshift documentation [recommends](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html) \nSo, you can use the tool to split the `psql`'s `\\copy` output to N parts and upload it to S3:\n```\npsql -c '\\copy some_big_table TO PROGRAM \"split-gzip-upload --bucket=your-bucket --path=some_big_table --slices=12\"'\n```\nwhich will upload your data gzipped a little bit faster, and, what is more valuable, split to N near-to-equal-length parts, which is good for loading to Redshift, as it can load them in parallel. \n\n#### Help\n```\nsplit-gzip-upload --help\nusage: split-gzip-upload [-h] --bucket BUCKET --path PATH [--slices SLICES]\n [--batch-size BATCH_SIZE]\n\noptional arguments:\n -h, --help show this help message and exit\n --bucket BUCKET S3 bucket name\n --path PATH path in S3 bucket, some/path will be converted to\n something like some/path.123.csv.gz, where 123 is a\n slice number\n --slices SLICES number of output slices\n --batch-size BATCH_SIZE\n number of lines in a batch\n```\n\n#### FAQ\n**Q:** What is it for? \n**A:** It's an all-in-one tool to split the STDIN, compress each part with Gzip, and upload compressed parts in parallel to S3\n\n**Q:** How is this different from linux [`split`](http://man7.org/linux/man-pages/man1/split.1.html) command with `--filter` \noption, like [here](https://stackoverflow.com/posts/23701140/revisions)? Why reinvent the wheel? \n**A:** Technically, if you can use `split` with `--filter`, better use `split`. But when using you have to specify the number of lines per file. \nIt's good if you know how many lines in your source file and you just divide the total number of lines in a source file by number of files you want to produce. \nBut the tool allows you just specify the number of lines per batch (portion of data written to compressed file) and don't think about calculations.\n\n**Q:** What is faster - gzip and upload with aws cli or use the tool? \n**A:** The tool is a little bit faster:\n```\nll ~/Downloads/hmnist_28_28_RGB.csv\n-rwxr-xr-x@ 1 andrii staff 41M Sep 19 14:24 /Users/andrii/Downloads/hmnist_28_28_RGB.csv\n```\nAnd with not so fast internet connection on MBP, first using the tool:\n```\ntime cat ~/Downloads/hmnist_28_28_RGB.csv | split-gzip-upload --bucket= --path=test-sgu --slices=5 --batch-size=100\n...\ncat ~/Downloads/hmnist_28_28_RGB.csv 0.00s user 0.05s system 1% cpu 3.449 total\nsplit-gzip-upload --bucket= --path=test-sgu --slices=5 3.40s user 0.31s system 31% cpu 11.955 total\n```\nand using the aws cli:\n```\ntime cat ~/Downloads/hmnist_28_28_RGB.csv | gzip | aws s3 cp - s3:///test-sgu.csv.gz \ncat ~/Downloads/hmnist_28_28_RGB.csv 0.00s user 0.03s system 0% cpu 5.609 total\ngzip 4.95s user 0.01s system 88% cpu 5.621 total\naws s3 cp - s3:///test-sgu.csv.gz 0.58s user 0.23s system 5% cpu 15.091 total\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/andriis/split-gzip-upload-tool", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "split-gzip-upload-tool", "package_url": "https://pypi.org/project/split-gzip-upload-tool/", "platform": "", "project_url": "https://pypi.org/project/split-gzip-upload-tool/", "project_urls": { "Homepage": "https://github.com/andriis/split-gzip-upload-tool" }, "release_url": "https://pypi.org/project/split-gzip-upload-tool/0.0.1/", "requires_dist": [ "boto3", "botocore" ], "requires_python": "", "summary": "Tool to split stdin, gzip it and upload to s3", "version": "0.0.1" }, "last_serial": 4365591, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "5218bcb841a29cd269a76a296f82108b", "sha256": "7fe450c1304488ef66c756e7151f94c5dafe25ad2a67e5e823ea8230c68b85f2" }, "downloads": -1, "filename": "split_gzip_upload_tool-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "5218bcb841a29cd269a76a296f82108b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7019, "upload_time": "2018-10-11T19:11:17", "url": "https://files.pythonhosted.org/packages/1d/b9/586b9ba6f75f4e194d5a421c6b88bddd6561facd488581342abe23bc6f76/split_gzip_upload_tool-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e02375a5fb59e00a72254ac24609706c", "sha256": "2cd0732559fe34c43bc5d9c46d19e6bec3a782011fd69c4b72124e158b6af0eb" }, "downloads": -1, "filename": "split-gzip-upload-tool-0.0.1.tar.gz", "has_sig": false, "md5_digest": "e02375a5fb59e00a72254ac24609706c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5065, "upload_time": "2018-10-11T19:11:19", "url": "https://files.pythonhosted.org/packages/61/a2/0cc916a941c682aa4388dfd70dcb5d7c3685d781cf140532a1282d7d2fc2/split-gzip-upload-tool-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5218bcb841a29cd269a76a296f82108b", "sha256": "7fe450c1304488ef66c756e7151f94c5dafe25ad2a67e5e823ea8230c68b85f2" }, "downloads": -1, "filename": "split_gzip_upload_tool-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "5218bcb841a29cd269a76a296f82108b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7019, "upload_time": "2018-10-11T19:11:17", "url": "https://files.pythonhosted.org/packages/1d/b9/586b9ba6f75f4e194d5a421c6b88bddd6561facd488581342abe23bc6f76/split_gzip_upload_tool-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e02375a5fb59e00a72254ac24609706c", "sha256": "2cd0732559fe34c43bc5d9c46d19e6bec3a782011fd69c4b72124e158b6af0eb" }, "downloads": -1, "filename": "split-gzip-upload-tool-0.0.1.tar.gz", "has_sig": false, "md5_digest": "e02375a5fb59e00a72254ac24609706c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5065, "upload_time": "2018-10-11T19:11:19", "url": "https://files.pythonhosted.org/packages/61/a2/0cc916a941c682aa4388dfd70dcb5d7c3685d781cf140532a1282d7d2fc2/split-gzip-upload-tool-0.0.1.tar.gz" } ] }