{ "info": { "author": "Konstantinos Siaterlis", "author_email": "siaterliskonsta@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# S3 Parquetifier\n[![Build Status](https://semaphoreci.com/api/v1/projects/e5f4d811-2000-4e01-a0e5-eb695ebc92d6/2651638/shields_badge.svg)](https://semaphoreci.com/thelastdev/s3-parquetifier)\n[![PyPI version fury.io](https://badge.fury.io/py/s3-parquetifier.svg)](https://pypi.org/project/s3-parquetifier/)\n[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)\n\nS3 Parquetifier is an ETL tool that can take a file from an S3 bucket convert it to Parquet format and\nsave it to another bucket.\n\nS3 Parquetifier supports the following file types\n- [x] CSV\n- [ ] JSON\n- [ ] TSV\n\n## Instructions\n\n### How to install\n\nTo install the package just run the following\n\n```python\npip install s3-parquetifier\n```\n\n### How to use it\n\nS3 parquetifier needs an AWS Account that will have at least read rights for the target bucket\nand read-write rights for the destination bucket. \n\nYou can read the following article on how to set up S3 roles and policies [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html)\n\n#### Running the Script\n\n```python\nfrom s3_parquetifier import S3Parquetifier\n\n# Call the covertor\nS3Parquetifier(\n aws_access_key='',\n aws_secret_key='',\n region='',\n verbose=True, # for verbosity or not\n source_bucket='',\n target_bucket='',\n type='S3' # for now only S3\n).convert_from_s3(\n source_key='',\n target_key='',\n file_type='csv' # for now only CSV,\n chunk_size=100000, # The number of rows per parquet\n dtype=None, # A dictionary defining the types of the columns\n skip_rows=None, # How many rows to skip per chunk\n compression='gzip', # The compression type\n keep_original_name_locally=True, # In order to keep the original filename or create a random when downloading the file\n encoding='utf-8' # Set the encoding of the file\n)\n```\n\n### Adding custom pre-processing function\n\nYou can add custom pre-processing function on your source file. Because this tool is designed for large files the preprocessing\nis taking place on every chunk separately. If the full file is needed for the preprocessing then a local preprocessing is needed in the source file.\n\nIn the following example, we are going to add custom columns on the chunk with some custom values.\nWe are going to add the columns `test1, test2, test3` with the values `1, 2, 3` respectively.\n\nWe define our function bellow named `pre_process` and we also define the arguments for the function `kwargs`.\nThe chunk DataFrame is not needed in the kwargs, it is taken by default. You have to pass your function as an argument in\n`pre_process_chunk` and the arguments in `kwargs` in the `convert_from_s3` method.\n\n```python\nfrom s3_parquetifier import S3Parquetifier\n\n\n# Add three new columns with custom values\ndef pre_process(chunk, columns=None, values=None):\n\n for index, column in enumerate(columns):\n chunk[column] = values[index]\n\n return chunk\n\n# define the arguments for the pre-processor\nkwargs = {\n 'columns': ['test1', 'test2', 'test3'],\n 'values': [1, 2, 3]\n}\n\n# Call the covertor\nS3Parquetifier(\n aws_access_key='',\n aws_secret_key='',\n region='',\n verbose=True, # for verbosity or not\n source_bucket='',\n target_bucket='',\n type='S3' # for now only S3\n).convert_from_s3(\n source_key='',\n target_key='',\n file_type='csv' # for now only CSV,\n chunk_size=100000, # The number of rows per parquet\n dtype=None, # A dictionary defining the types of the columns\n skip_rows=None, # How many rows to skip per chunk\n compression='gzip', # The compression type\n keep_original_name_locally=True, # In order to keep the original filename or create a random when downloading the file\n encoding='utf-8', # Set the encoding of the file\n pre_process_chunk=pre_process, # A preprocessing function that will pre-process the each chunk\n kwargs=kwargs # potential extra arguments for the pre-preocess function\n)\n```\n\n## ToDo\n\n- [x] Add support to handle local files too\n- [ ] Add support for JSON\n- [ ] Add streaming from url support", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Orfium/s3-parquetifier", "keywords": "", "license": "LICENSE", "maintainer": "", "maintainer_email": "", "name": "s3-parquetifier", "package_url": "https://pypi.org/project/s3-parquetifier/", "platform": "", "project_url": "https://pypi.org/project/s3-parquetifier/", "project_urls": { "Homepage": "https://github.com/Orfium/s3-parquetifier" }, "release_url": "https://pypi.org/project/s3-parquetifier/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "ETL job from CSV to Parquet in AWS S3", "version": "0.1.1" }, "last_serial": 5545225, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "005b1b848ee8beafbfbdea1488e5e18b", "sha256": "0de91a509218b981d242549b358b15eaee967983ac1d3d49080df7214fc0060b" }, "downloads": -1, "filename": "s3-parquetifier-0.0.1.tar.gz", "has_sig": false, "md5_digest": "005b1b848ee8beafbfbdea1488e5e18b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8568, "upload_time": "2019-04-22T12:23:01", "url": "https://files.pythonhosted.org/packages/37/46/cf59566b1c1a99dcc211873c48517337cde6607dc56382295dc561a253fa/s3-parquetifier-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "3152323dcc9e6071c24f87cda15806c7", "sha256": "5c0d8fd6de94ad9e69d6ce8486bf939e5be87df18bb0d5c58f58d85bd2fa3266" }, "downloads": -1, "filename": "s3_parquetifier-0.0.2-py2-none-any.whl", "has_sig": false, "md5_digest": "3152323dcc9e6071c24f87cda15806c7", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 12476, "upload_time": "2019-04-30T16:40:26", "url": "https://files.pythonhosted.org/packages/d8/59/39cf2407058ea724f7ae6cac782f96852e0803cb0cae560263f85ae6268e/s3_parquetifier-0.0.2-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "165410c918f74b9a1e56a82eb3a6c9a1", "sha256": "5354e4215b590fccc5a6d74e200a0600e0f9bcfa05f4d7c804b11972335ac2d7" }, "downloads": -1, "filename": "s3-parquetifier-0.0.2.tar.gz", "has_sig": false, "md5_digest": "165410c918f74b9a1e56a82eb3a6c9a1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8615, "upload_time": "2019-04-30T16:50:01", "url": "https://files.pythonhosted.org/packages/97/8f/8a60479e9ccf6b39e0a649bc7958bff1d20c9e50b4a12f43e65901bee3c8/s3-parquetifier-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "1c0c2d1f01de1e25155e7f5779c79fd4", "sha256": "5b84ae0948ed08e8e505fe299ae6726eafa54fd663eb1ff19ca423bed244ff95" }, "downloads": -1, "filename": "s3-parquetifier-0.0.3.tar.gz", "has_sig": false, "md5_digest": "1c0c2d1f01de1e25155e7f5779c79fd4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8612, "upload_time": "2019-04-30T16:51:07", "url": "https://files.pythonhosted.org/packages/77/8c/4cd9e1b5b915947e3dd3d03b42ffeb68cc9f2e4137651e1d80482c717499/s3-parquetifier-0.0.3.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "52e9103b0300651b8dc82422e3182174", "sha256": "6efc0fe4df58800106e1dbaed5027f1bbf9a4c655dd796ed6e162ae323330a47" }, "downloads": -1, "filename": "s3-parquetifier-0.0.6.tar.gz", "has_sig": false, "md5_digest": "52e9103b0300651b8dc82422e3182174", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9721, "upload_time": "2019-05-30T11:51:16", "url": "https://files.pythonhosted.org/packages/b2/04/abc904b4c9a7fa03da13ca5547562a4107e45ca300a631f326df40f54558/s3-parquetifier-0.0.6.tar.gz" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "7e8d69de9fb5b6baaaad36949c66b8df", "sha256": "3839beada5b16b47d8e0d4f5e0acc6e0d6cc4971e97349f4d9d8ee75d080bf79" }, "downloads": -1, "filename": "s3-parquetifier-0.0.7.tar.gz", "has_sig": false, "md5_digest": "7e8d69de9fb5b6baaaad36949c66b8df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10013, "upload_time": "2019-06-03T10:52:53", "url": "https://files.pythonhosted.org/packages/75/c7/29a518ea1d3290b838640739f246df3c9f1bfcde678e2729fd78493da76d/s3-parquetifier-0.0.7.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "8d39cd55efcdb29fe1acd48723396050", "sha256": "600b8596072a8f4c56d85b3ff1992640062691ab8b8cfaee99528e12dbc6a23f" }, "downloads": -1, "filename": "s3-parquetifier-0.0.8.tar.gz", "has_sig": false, "md5_digest": "8d39cd55efcdb29fe1acd48723396050", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10123, "upload_time": "2019-06-11T08:44:15", "url": "https://files.pythonhosted.org/packages/7d/05/e23ea444d99de22aa6f6f55bb947e0bd42ce2a00586e0feb3d3423ecf036/s3-parquetifier-0.0.8.tar.gz" } ], "0.0.9": [ { "comment_text": "", "digests": { "md5": "137b38750cf803e04d215e43336f19ed", "sha256": "48daf2098ca653e6d69fc544957befb216eebb2cfdec1bfe00ddafcefc4a64f2" }, "downloads": -1, "filename": "s3-parquetifier-0.0.9.tar.gz", "has_sig": false, "md5_digest": "137b38750cf803e04d215e43336f19ed", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10103, "upload_time": "2019-06-11T10:40:09", "url": "https://files.pythonhosted.org/packages/fe/2c/5c78dfe51751ba7687049be61cd363ca44d0b65afc8bf9d6ef9db219cb77/s3-parquetifier-0.0.9.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "c766b810fdf14d3c9656edcb6e4a8527", "sha256": "0f0c5dfedd595fe1c5c68ccb0a64ae8010146789086ab6ff4dfd67575953c078" }, "downloads": -1, "filename": "s3-parquetifier-0.1.0.tar.gz", "has_sig": false, "md5_digest": "c766b810fdf14d3c9656edcb6e4a8527", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10099, "upload_time": "2019-06-11T10:54:09", "url": "https://files.pythonhosted.org/packages/67/7c/d7cae0ebcaa3d2902275390f46b52fea6e59ee27dfcd2f7093e8b6b85a69/s3-parquetifier-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "92b33b00300240a4638114c60f87d743", "sha256": "d2f74c00e149dd1156c3c096561ab7be03b88346ebc682c36f59387a098150b2" }, "downloads": -1, "filename": "s3-parquetifier-0.1.1.tar.gz", "has_sig": false, "md5_digest": "92b33b00300240a4638114c60f87d743", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9804, "upload_time": "2019-07-17T10:49:04", "url": "https://files.pythonhosted.org/packages/18/c1/e0fadf5256fb2ef4deea7c69b1ba8eea53096d8093b956580658ffb5111d/s3-parquetifier-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "92b33b00300240a4638114c60f87d743", "sha256": "d2f74c00e149dd1156c3c096561ab7be03b88346ebc682c36f59387a098150b2" }, "downloads": -1, "filename": "s3-parquetifier-0.1.1.tar.gz", "has_sig": false, "md5_digest": "92b33b00300240a4638114c60f87d743", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9804, "upload_time": "2019-07-17T10:49:04", "url": "https://files.pythonhosted.org/packages/18/c1/e0fadf5256fb2ef4deea7c69b1ba8eea53096d8093b956580658ffb5111d/s3-parquetifier-0.1.1.tar.gz" } ] }