{ "info": { "author": "Integrichain Innovation Team", "author_email": "engineering@integrichain.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.7" ], "description": "# s3parq\nParquet file management in S3 for hive-style partitioned data\n\n## What is this? \nIn many ways, parquet standards are still the wild west of data. Depending on your partitioning style, metadata store strategy etc. you can tackle the big data beast in a multitude of different ways. \nThis is an AWS-specific solution intended to serve as an interface between python programs and any of the multitude of tools used to access this data. s3parq is an end-to-end solution for:\n1. writing data from pandas dataframes to s3 as partitioned parquet.\n2. reading data from s3 partitioned parquet *that was created by s3parq* to pandas dataframes.\n\n*NOTE:* s3parq writes (and reads) metadata into the s3 objects that is used to filter records _before_ any file i/o; this makes selecting datasets faster, but also means you need to have written data with s3parq to read it with s3parq. \n\n*TLDR - to read with s3parq, you need to have written with s3parq* \n\n## Basic Usage\n\nWe get data by dataset name. \n\n import s3parq as parq\n\n bucket = 'mybucket'\n key = 'path-in-bucket/to/my/dataset'\n dataframe = pd.DataFrame(['some_big_data'])\n\n ## writing to s3\n parq.publish( bucket=bucket,\n key=key,\n dataframe=dataframe, \n partitions= ['column1',\n 'column2'])\n\n ## reading from s3, getting only records with an id >= 150\n pandas_dataframe = parq.fetch( bucket=bucket,\n key=key,\n filter= {\"partition\":'id',\n \"values\":150, \n \"comparison\":'>='})\n\n\n## Getting Existing Partition Values \nA lot of pre-filtering involves trimming down your dataset based on the values already in another data set. To make that easier, s3parq provides a few super helpful helper functions: \n\n partition = 'order_id'\n\n ## max value for order_id column, correctly typed\n max_val = parq.get_max_partition_value(bucket,\n key,\n partition)\n\n ## partition values not in a list of order_ids. \n ## if partition values are 1-6 would return [5,6] correctly typed.\n list_of_vals = [0,1,2,3,4]\n new_vals = parq.get_diff_partition_values( bucket,\n key,\n partition,\n list_of_vals)\n\n ## list values not in partition value list\n ## if partition values are 3-8 would return [1,2] correctly typed.\n list_of_vals = [1,2,3,4]\n missing_vals = parq.get_diff_partition_values( bucket,\n key,\n partition,\n list_of_vals,\n True)\n\n ## df of values in one dataset's partition and not another's\n ## this works by input -> where extra values would be, and comparison -> where they might not be\n ## similar to the get_diff_partition_values but handles it at the dataset level\n missing_data = parq.fetch_diff( input_bucket, \n input_key, \n comparison_bucket, \n comparison_key, \n partition)\n\n ## all values for a partition\n all_vals = parq.get_all_partition_values( bucket,\n key,\n partition)\n\n## Redshift Spectrum\nDataframes published to S3 can optionally be queried in AWS Redshift Spectrum. To enable this functionality, you must have an external database configured in Redshift. See the [AWS docs](https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html) for help setting up a database in Redshift. To enable this functionality in S3parq, simply pass a dictionary of configurations to `publish()` via the redshift_params argument.\n\n`redshift_params` is a dictionary which *must* contain the following keys: values (values are all strings unless noted otherwise):\n- schema_name: name of the schema to add table_name to\n- table_name: name of the table to create in Redshift\n- iam_role: ARN link to an IAM Role with read/write Spectrum permissions\n- region: AWS region (e.g. us-east-1)\n- cluster_id: name of the cluster Redshift is configured on\n- host: URL to the cluster specified in cluster_id\n- port: port to connect to Redshift (usually 5439)\n- db_name: name of the (existing) external database configured to use Redshift Spectrum\n\nIf redshift_params is present but invalid, the entire `publish()` fails.\n\n*NOTE:* Spectrum schemas do _not_ work as normal database schemas. Tables are global to a Redshift Spectrum database, so each schema belonging to `db_name` can access all tables, regardless of the schema they are created with. Instead of schemas, different table registries require different Redshift Spectrum databases.\n\n## Gotchas\n- Filters can only be applied to partitions; this is because we do not actually pull down any of the data until after the filtering has happened. This aligns with data best practices; the things you filter on regularly are the things you should partition on!\n\n- When using `get_diff_partition_values` remembering which set you want can be confusing. You can refer to these diagrams: \n![venn diagram of reverse value](./assets/s3parq_get_diff_partition_values.png)\n![table of difference values](./assets/s3parq_diff_table.png)\n\n## Contribution\nWe welcome pull requests!\nSome basic guidelines:\n- *test yo' code.* code coverage is important! \n- *be respectful.* in pr comments, code comments etc;\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/IntegriChain1/s3parq", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "s3parq", "package_url": "https://pypi.org/project/s3parq/", "platform": "", "project_url": "https://pypi.org/project/s3parq/", "project_urls": { "Homepage": "https://github.com/IntegriChain1/s3parq" }, "release_url": "https://pypi.org/project/s3parq/2.1.4/", "requires_dist": [ "pandas (==0.24.2)", "pyarrow (==0.13.0)", "boto3 (==1.9.177)", "s3fs (==0.2.1)", "dfmock (==0.0.14)", "moto (==1.3.8)", "psycopg2 (==2.8.3)", "SQLAlchemy (==1.3.5)", "pytest (==5.0.0)" ], "requires_python": "", "summary": "Write and read/query s3 parquet data using Athena/Spectrum/Hive style partitioning.", "version": "2.1.4" }, "last_serial": 5858121, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "cea65c6557b8a3c97924b5649c4d67ff", "sha256": "0b9af2159f821267cba1730de7ade36150137ba932798bcdfac420a258194772" }, "downloads": -1, "filename": "s3parq-0.0.1-py3.7.egg", "has_sig": false, "md5_digest": "cea65c6557b8a3c97924b5649c4d67ff", "packagetype": "bdist_egg", "python_version": "3.7", "requires_python": null, "size": 46108, "upload_time": "2019-03-20T00:34:39", "url": "https://files.pythonhosted.org/packages/67/9e/8d6458a844d366f4db9890a26addef068715a1e5959d36b99596a7446f16/s3parq-0.0.1-py3.7.egg" }, { "comment_text": "", "digests": { "md5": "fab0ca1684bf76a264432f151acf07b1", "sha256": "1d7b8a424c460a8b641b095786174393076a7c9048ab10fbc838e7a32c81d39a" }, "downloads": -1, "filename": "s3parq-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "fab0ca1684bf76a264432f151acf07b1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21497, "upload_time": "2019-03-20T00:34:35", "url": "https://files.pythonhosted.org/packages/74/bb/fa441c9e33a97eb8a45cead03a9f354e20c39755b8e6dd87cd340cbc5ea2/s3parq-0.0.1-py3-none-any.whl" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "99c53fe4bf72a675ce483848b2faa2e7", "sha256": "6c2efcc63732bb1105c630a221f02ca1d77d097d9d3199e7733dd80f9e798ec7" }, "downloads": -1, "filename": "s3parq-0.0.2-py3.7.egg", "has_sig": false, "md5_digest": "99c53fe4bf72a675ce483848b2faa2e7", "packagetype": "bdist_egg", "python_version": "3.7", "requires_python": null, "size": 23865, "upload_time": "2019-03-25T20:42:40", "url": "https://files.pythonhosted.org/packages/6b/df/00d7927bd728d1c4da40fc9eec1b1414754e703e2dcb5f5241dbc95db424/s3parq-0.0.2-py3.7.egg" }, { "comment_text": "", "digests": { "md5": "b0b3fffe41501142087141581d16a460", "sha256": "76f20d569a520c0d1a83df8bd19a21f4a8639f62cd3005ffc03dad5002cfc9b1" }, "downloads": -1, "filename": "s3parq-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b0b3fffe41501142087141581d16a460", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11646, "upload_time": "2019-03-25T20:42:39", "url": "https://files.pythonhosted.org/packages/eb/d9/854e159738ba2ab3b324c55bdc891bfdc2abb7187789e9475d128f3b84d8/s3parq-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1a4624469fe97fbb555eaf487c0993f9", "sha256": "c542998d6dba6e43c34219d499ff34c185f510db7de186b4a88e003b13c20b09" }, "downloads": -1, "filename": "s3parq-0.0.2.tar.gz", "has_sig": false, "md5_digest": "1a4624469fe97fbb555eaf487c0993f9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11092, "upload_time": "2019-03-25T20:42:41", "url": "https://files.pythonhosted.org/packages/6e/ca/6158e1990177bd9e9c54036c7b7781ddc866d5a1b828813fb255c1c45afb/s3parq-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "43650de03f542dcea92a8fa107319e31", "sha256": "ce47fbb25260dec93a5d0fba5c21696c4d3c912fc130cab242e81c616305bac4" }, "downloads": -1, "filename": "s3parq-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "43650de03f542dcea92a8fa107319e31", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13206, "upload_time": "2019-04-23T13:28:11", "url": "https://files.pythonhosted.org/packages/b7/7b/bdfdef6988d513c30f0dc45562acec594c2d0f6adda27a5fbd5d226c01d2/s3parq-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c793783a2197a24505071f700c60196a", "sha256": "ca0569b0fddf1fb8f0f267579b2ec848af5090eb13ee57962f2311c9220c0eb0" }, "downloads": -1, "filename": "s3parq-0.0.3.tar.gz", "has_sig": false, "md5_digest": "c793783a2197a24505071f700c60196a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12965, "upload_time": "2019-04-23T13:28:13", "url": "https://files.pythonhosted.org/packages/14/7e/751a20dd6643d11bbe053d74d5cb9ed0138bcf4a2d5100202607555daee0/s3parq-0.0.3.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "27368596ce64bafa05270c6843fe8926", "sha256": "8df2daa73e94ad11b7d943fa994ebd2423ff597af2ac7c8668aec35d4b3455cc" }, "downloads": -1, "filename": "s3parq-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "27368596ce64bafa05270c6843fe8926", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23530, "upload_time": "2019-04-27T01:35:16", "url": "https://files.pythonhosted.org/packages/d3/1d/06f44a1d44867c6f84ccd615d04a8589e5f7b63d65a6555b3ec4e3dc5a8f/s3parq-1.0.0-py3-none-any.whl" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "49cb1f32e278d796825d5e2b479ddfc2", "sha256": "8db775867b29510501ef2073d69380a095dc5404617adf8f5fb50dacbbf1d9ac" }, "downloads": -1, "filename": "s3parq-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "49cb1f32e278d796825d5e2b479ddfc2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23528, "upload_time": "2019-05-10T20:36:05", "url": "https://files.pythonhosted.org/packages/f7/a7/5e209370bc6c25811a190e8f751de2a5029afca59f686271ec571ca955da/s3parq-1.0.1-py3-none-any.whl" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "20ee76d6612cae6cbf5c0a22760e88d6", "sha256": "3b02ee43e62d63cd5f061dd20c6a3a7183d83d8191bd6c3ab7c98d0095a0b79c" }, "downloads": -1, "filename": "s3parq-1.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "20ee76d6612cae6cbf5c0a22760e88d6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13853, "upload_time": "2019-05-21T13:40:06", "url": "https://files.pythonhosted.org/packages/76/69/60e71462d61d320f72417177539aafac9391b1e2a4bc43d7a9d654a44ff8/s3parq-1.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a45ccab87bd19cfa0d473cbec9a4337b", "sha256": "b0f6222e03d7e271e6fd4d173f0fae62270d1ce1bca6deb20fd554de35036e99" }, "downloads": -1, "filename": "s3parq-1.0.2.tar.gz", "has_sig": false, "md5_digest": "a45ccab87bd19cfa0d473cbec9a4337b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13410, "upload_time": "2019-05-21T13:40:07", "url": "https://files.pythonhosted.org/packages/d7/99/2fe03d94bf08b21db43d0f9dc55406d1e0bf4455b6a1d0f6ca8bcc27d184/s3parq-1.0.2.tar.gz" } ], "2.0.0": [ { "comment_text": "", "digests": { "md5": "97504e8e26e329f8416410e043fb3aa5", "sha256": "7b8c24fbb6fd6b5ade446768d85c6dc48ae7abb300b177fae2bccbdd7ee6c039" }, "downloads": -1, "filename": "s3parq-2.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "97504e8e26e329f8416410e043fb3aa5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13227, "upload_time": "2019-06-17T14:51:20", "url": "https://files.pythonhosted.org/packages/21/ae/8549bcae64e14e19ee27395d8f97b1276c656d2eac93d208d89dd0a01dc1/s3parq-2.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "773a0b30735aa0557a35d057d803c529", "sha256": "5fc8440d4316245777ec4b5593663077192bafdf390ac4c1582141419acc16fe" }, "downloads": -1, "filename": "s3parq-2.0.0.tar.gz", "has_sig": false, "md5_digest": "773a0b30735aa0557a35d057d803c529", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13031, "upload_time": "2019-06-17T14:51:22", "url": "https://files.pythonhosted.org/packages/f0/0a/cc5a492286f82a4f98583f367c26c1a8b072b9e4b671eabb2cab77320d1e/s3parq-2.0.0.tar.gz" } ], "2.1.0": [ { "comment_text": "", "digests": { "md5": "faa14de09fb2cacea8280b0ae4081870", "sha256": "b4f4736b5667f6b37191529e16b3f4abb843ea6154ce4a61dec690d0174f0779" }, "downloads": -1, "filename": "s3parq-2.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "faa14de09fb2cacea8280b0ae4081870", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20130, "upload_time": "2019-07-18T20:13:33", "url": "https://files.pythonhosted.org/packages/bb/13/120f441b2951bc7061311d8ec98f5ac6b119def19053aa62a7d2716ea6c5/s3parq-2.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1255696fafbeccb4f4f867cec681b154", "sha256": "a885dfc53c3d5cd4789f4cb146a71e3d6506a94ec38882867faff5ad9ca3e88d" }, "downloads": -1, "filename": "s3parq-2.1.0.tar.gz", "has_sig": false, "md5_digest": "1255696fafbeccb4f4f867cec681b154", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19282, "upload_time": "2019-07-18T20:13:35", "url": "https://files.pythonhosted.org/packages/c3/06/517ada76f9748253b8b58b0bc8d745152db5a5cff7e726a2a83fd2d17ddd/s3parq-2.1.0.tar.gz" } ], "2.1.2": [ { "comment_text": "", "digests": { "md5": "2169d1f928a6fc2e06dbce1eec90ab51", "sha256": "dfa3ac6d0df8dcca3350bf011a8f5c3063144d5f4a024622ca5965d26638e293" }, "downloads": -1, "filename": "s3parq-2.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "2169d1f928a6fc2e06dbce1eec90ab51", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20934, "upload_time": "2019-07-24T19:50:29", "url": "https://files.pythonhosted.org/packages/4d/bc/94d6f7e5b4f2844da5680e9768d0978c7a06287b414e3415b367528cd111/s3parq-2.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "bc941e25b48ad8153ff7b32fa15ecbcd", "sha256": "ed3353ebe311ebf44bd16b6795964bbf8d344d5be4c9e84acdc04cf436bc9fb4" }, "downloads": -1, "filename": "s3parq-2.1.2.tar.gz", "has_sig": false, "md5_digest": "bc941e25b48ad8153ff7b32fa15ecbcd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19957, "upload_time": "2019-07-24T19:50:33", "url": "https://files.pythonhosted.org/packages/5a/d3/ce3b3810d3e75443576cce1d49541f8c0ec70f2e5a10c7d4ad756189f33c/s3parq-2.1.2.tar.gz" } ], "2.1.3": [ { "comment_text": "", "digests": { "md5": "61dd4c7ce9402eed9dba75f9a0159795", "sha256": "1f726b36f219e24665616d1ce167c66ecf43cb8ad2cdff678d09c92baf565d0b" }, "downloads": -1, "filename": "s3parq-2.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "61dd4c7ce9402eed9dba75f9a0159795", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 22003, "upload_time": "2019-08-02T14:56:42", "url": "https://files.pythonhosted.org/packages/b3/03/01faee05c71f4b8472b0bb4478933bc3e254a19b338d9b877cbfa478b5f8/s3parq-2.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e0be62151c306e7820e60bbfd7530f0f", "sha256": "e611288ea86913c55bc51ec84a6a37c69cb13744e4b9b129449c99c1184887e9" }, "downloads": -1, "filename": "s3parq-2.1.3.tar.gz", "has_sig": false, "md5_digest": "e0be62151c306e7820e60bbfd7530f0f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20017, "upload_time": "2019-08-02T14:56:43", "url": "https://files.pythonhosted.org/packages/12/14/0b8d9f7d983774d4f19fbc4ad41688a06a050608027b0d0f96472cf0477e/s3parq-2.1.3.tar.gz" } ], "2.1.4": [ { "comment_text": "", "digests": { "md5": "4ebc2f3746ca6bbc91ed9b8df056a996", "sha256": "190a24b66811f326a840e7145bda64a3db58e14b7c48c044912ef579eae6d2f5" }, "downloads": -1, "filename": "s3parq-2.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "4ebc2f3746ca6bbc91ed9b8df056a996", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20635, "upload_time": "2019-09-17T19:14:57", "url": "https://files.pythonhosted.org/packages/fb/1f/742202ec9140fb82861dfb9073c143c55c72a2178e426655fb4f80f2925a/s3parq-2.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "57061ffaef71aa0fdb8aaf3921e4eb6c", "sha256": "38c8c70cf2e9c0c5c9e3e7ef57cca119c5b21132c0de3488eb283aa6cc8abeaa" }, "downloads": -1, "filename": "s3parq-2.1.4.tar.gz", "has_sig": false, "md5_digest": "57061ffaef71aa0fdb8aaf3921e4eb6c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19901, "upload_time": "2019-09-17T19:14:59", "url": "https://files.pythonhosted.org/packages/a1/27/26c3da59b1f8c353794fcf142b51a1b81200afa497b1fc55d7d25aefe11e/s3parq-2.1.4.tar.gz" } ], "2.1.4a0": [ { "comment_text": "", "digests": { "md5": "718714a49dae0e45e2c84f587f6a7d9d", "sha256": "59959a98f270e60dac3ab0baddfdb09e8152c5940481537dc6012f79044eb362" }, "downloads": -1, "filename": "s3parq-2.1.4a0-py3-none-any.whl", "has_sig": false, "md5_digest": "718714a49dae0e45e2c84f587f6a7d9d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21343, "upload_time": "2019-09-19T18:40:25", "url": "https://files.pythonhosted.org/packages/2b/82/08889c81485e1f47e3c4ec89a4124ca81dc04df795259459e45f6456118d/s3parq-2.1.4a0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4ebc2f3746ca6bbc91ed9b8df056a996", "sha256": "190a24b66811f326a840e7145bda64a3db58e14b7c48c044912ef579eae6d2f5" }, "downloads": -1, "filename": "s3parq-2.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "4ebc2f3746ca6bbc91ed9b8df056a996", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20635, "upload_time": "2019-09-17T19:14:57", "url": "https://files.pythonhosted.org/packages/fb/1f/742202ec9140fb82861dfb9073c143c55c72a2178e426655fb4f80f2925a/s3parq-2.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "57061ffaef71aa0fdb8aaf3921e4eb6c", "sha256": "38c8c70cf2e9c0c5c9e3e7ef57cca119c5b21132c0de3488eb283aa6cc8abeaa" }, "downloads": -1, "filename": "s3parq-2.1.4.tar.gz", "has_sig": false, "md5_digest": "57061ffaef71aa0fdb8aaf3921e4eb6c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19901, "upload_time": "2019-09-17T19:14:59", "url": "https://files.pythonhosted.org/packages/a1/27/26c3da59b1f8c353794fcf142b51a1b81200afa497b1fc55d7d25aefe11e/s3parq-2.1.4.tar.gz" } ] }