{ "info": { "author": "Journera Inc", "author_email": "opensource@journera.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: System Administrators", "License :: OSI Approved :: BSD License", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy" ], "description": "# Glutil\n\nA collection of utilities for managing partitions of tables in the AWS Glue Data Catalog that are built on datasets stored in S3.\n\n[![Build Status](https://travis-ci.org/Journera/glutil.svg?branch=master)](https://travis-ci.org/Journera/glutil)\n\n## Background\n\nAWS's Glue Data Catalog provides an index of the location and schema of your data across AWS data stores and is used to reference sources and targets for ETL jobs in AWS Glue. It is fully-integrated with AWS Athena, an ad-hoc query tool that uses the Hive metastore to build external tables on top of S3 data and PrestoDB to query the data with standard SQL.\n\nJournera heavily uses Kinesis Firehoses to write data from our platform to S3 in near real-time, Athena for ad-hoc analysis of data on S3, and Glue's serverless engine to execute PySpark ETL jobs on S3 data using the tables defined in the Data Catalog.\nUsing the Data Catalog is generally pretty great, but sometimes all of these managed services don't play well together, or a configuration mistake was made (e.g., in a table DDL).\nFor those cases, we have these utilities.\n\nOur original use case for this project was as a Glue Crawler replacement for adding new partitions to tables that don't use Hive-style partitions and for tables built on top of S3 datasets that the Glue Crawler could not successfully parse.\nFor the most part this is a workaround, because of current limitations with the Glue Crawler and Terraform, which does not support configuring Kinesis Firehoses to write JSON data to S3 using formatted prefixes.\n\n## Installation\n\n`glutil` can be installed using pip.\n\n``` bash\npip install glutil\n```\n\nIf you wish to manually install it, you can clone the repository and run\n\n``` bash\npython3 setup.py install\n```\n\n## Provided Utilities\n\nThere are three main ways to use these utilities, either by using the `glutil` library in your python code, by using the provided `glutil` command line script, or as a lambda replacement for a Glue Crawler.\n\n## Built-In Assumptions\n\nBecause `glutil` started life as a way to work with Journera-managed data there are still a number of assumptions built in to the code.\nIdeally these will be removed in the future to enable use with more diverse sets of data.\n\n1. The tables use S3 as their backing data store.\n\n1. All partitions are stored under the table's location.\n\n For example, if you have a table with the location `s3://some-data-bucket/my-table/`, `glutil` will only find partitions located in `s3://some-data-bucket/my-table/`.\n Your table location can be as deep or shallow as you want, `glutil` will operate the same for a table located in `s3://bucket/path/to/table/it/goes/here/` and `s3://bucket/`.\n\n1. Your partition keys are `[year, month, day, hour]`.\n\n1. Your partitions are stored in one of two ways (examples assume your table's location is `s3://bucket/table/`):\n\n 1. Partitions are stored in `key=value` form or pathed similarly (examples below):\n\n ```\n s3://bucket/table/YYYY/MM/DD/HH/\n s3://bucket/table/year=YYYY/month=MM/day=DD/hour=HH/\n ```\n\n 2. You have a single partition key, which is your path with slashes changed into dashes (examples below):\n\n ```\n s3://bucket/table/YYYY/MM/DD/HH/ => partition value of YYYY-MM-DD-HH\n ```\n\n\n\n## IAM Permissions\n\nTo use `glutil` you need the following IAM permissions:\n\n- `glue:GetDatabase`\n- `glue:GetTable`\n- `glue:GetTables`\n- `glue:BatchCreatePartition`\n- `glue:BatchDeleteTable`\n- `glue:BatchDeletePartition`\n- `glue:GetPartitions`\n- `glue:UpdatePartition`\n- `s3:ListBucket` on the buckets containing your data\n- `s3:GetObject` on the buckets containing your data\n\nIf you're only using the `create-partition` lambda, you can get by with only:\n\n- `glue:GetDatabase`\n- `glue:GetTable`\n- `glue:BatchCreatePartition`\n- `glue:GetPartitions`\n- `s3:ListBucket`\n- `s3:GetObject`\n\n## `glutil` Command Line Interface (CLI)\n\nThe `glutil` CLI includes a number of subcommands for managing partitions and fixing the Glue Data Catalog when things go wrong.\nMost of the commands were written to fix issues caused by a Glue Crawler gone wrong, moving underlying data, or dealing with newly created data.\n\nFor the most part, they operate with the leading principle that any action they take can be reversed (if it was an incorrect action) by running the `glutil create-partitions` command.\n\nAll commands support the `--dry-run` flag, which will output the command's expected result without modifying the Glue Data Catalog.\n\nBelow are short descriptions of the available commands.\nFor larger descriptions and command line arguments, run `glutil --help`.\n\n### `glutil create-partitions`\n\n`create-partitions` is the original use case for this code.\nRunning it will search S3 for partitioned data, and will create new partitions for data missing from the Glue Data Catalog.\n\n### `glutil delete-all-partitions`\n\n`delete-all-partitions` will query the Glue Data Catalog and delete any partitions attached\nto the specified table.\nFor the most part it is substantially faster to just delete the entire table and recreate it because of AWS batch limits, but sometimes it's harder to recreate than to remove all partitions.\n\n### `glutil delete-bad-partitions`\n\n`delete-bad-partitions` will remove partitions that meet the following criteria from the catalog:\n\n- Partitions without any data in S3\n- Partitions with values that do not match their S3 location (ex. Partition with values `[2019 01 02 03]` with a location of anything other than `s3://table/path/2019/01/02/03/`)\n\nIn general, if you use `glutil create-partitions` multiple times and see attempts to create the same partition both times, you should run `delete-bad-partitions` and try `create-partitions` again.\n\n### `glutil delete-missing-partitions`\n\n`delete-missing-partitions` will remove any partition in the Glue Data Catalog without data in S3.\n\n### `glutil update-partitions`\n\n`update-partitions` should be run after moving your data in S3 and updating your table's location in the catalog.\nIt updates partitions by finding all partitions in S3, and checking if a partition with matching values exists in the catalog.\nIf it finds a matching partition, it updates the existing partition with the new location.\n\n### `glutil delete-bad-tables`\n\nSometimes when running a Glue Crawler, the crawler doesn't aggregate the data correctly, and instead creates tables for individual partitions.\nWhen this happens, it may create a large number of junk tables in the catalog.\n`delete-bad-tables` should be run to fix this.\n\n`delete-bad-tables` deletes any tables in your Glue Data Catalog that meet the following criteria:\n\n- A table with a path that is below another table's path.\n\n For example, if you have two tables with these paths:\n\n ```\n s3://some-data-bucket/table-path/, and\n s3://some-data-bucket/table-path/another-table/\n ```\n\n The table at `s3://some-data-bucket/table-path/another-table/` will be deleted.\n\n- A table with the same location as another, with a name that's a superstring of the other's (this is from the Glue Crawler semantic of creating tables which would otherwise have the same name with the name {table}-somelongid).\n\n For example, if you have the tables `foo` and `foo-buzzer`, both with the same location, `foo-buzzer` will be deleted.\n\n## Running `create-partitions` as a Lambda\n\nJournera's biggest use for this library is as a Glue Crawler replacement for tables and datasets the Glue Crawlers have problems parsing.\nInformation on this lambda can be found in the [lambda](./lambda) directory.\n\n## Contributing to Glutil\nThis project was recently open-sourced. As such, please pardon any sharp edges, and let us know about them by [creating an issue](https://github.com/Journera/glutil/issues/new).\n\nAll contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.\n\nA detailed overview on how to contribute can be found in the [contributing guide](CONTRIBUTING.md).\n\n## License\n[BSD 3](LICENSE)\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Journera/glutil", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "glutil", "package_url": "https://pypi.org/project/glutil/", "platform": "", "project_url": "https://pypi.org/project/glutil/", "project_urls": { "Homepage": "https://github.com/Journera/glutil" }, "release_url": "https://pypi.org/project/glutil/1.0.0/", "requires_dist": [ "boto3 (>=1.9)" ], "requires_python": ">=3.6.0", "summary": "A collection of utilities for managing AWS Glue Data Catalog tables backed by S3", "version": "1.0.0", "yanked": false, "yanked_reason": null }, "last_serial": 6048713, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "f8c8a65a4c56178762b05a044ba471a0", "sha256": "06be9dbd2fe871b34a94467dd5b20cca994695b53241918b7720aaf9d35bbd47" }, "downloads": -1, "filename": "glutil-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "f8c8a65a4c56178762b05a044ba471a0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 15947, "upload_time": "2019-06-19T15:46:43", "upload_time_iso_8601": "2019-06-19T15:46:43.941284Z", "url": "https://files.pythonhosted.org/packages/79/f0/0568a9db81429943a5da16cece53b11c458ea276c67844875386a79c002c/glutil-0.1.2-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "057c13071b8f5c9f1f2ab498b3d02b53", "sha256": "05d424fd775e754b5565b7dc1a4fc9d26704c1366f106ade4ffa09d3fafbde15" }, "downloads": -1, "filename": "glutil-0.1.2.tar.gz", "has_sig": false, "md5_digest": "057c13071b8f5c9f1f2ab498b3d02b53", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 16310, "upload_time": "2019-06-19T15:46:45", "upload_time_iso_8601": "2019-06-19T15:46:45.560331Z", "url": "https://files.pythonhosted.org/packages/51/de/f2ac5e0ff869e1198cbe7b8e9ed4b763bdc2ae4179244006062fa7e6fdc4/glutil-0.1.2.tar.gz", "yanked": false, "yanked_reason": null } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "f24e80d0483587025928e2f679c89604", "sha256": "43efc3af692773636f1a54f3061893ddb9326eaa42d1d14de14c5f359f24062d" }, "downloads": -1, "filename": "glutil-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "f24e80d0483587025928e2f679c89604", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 16469, "upload_time": "2019-06-24T21:21:36", "upload_time_iso_8601": "2019-06-24T21:21:36.574512Z", "url": "https://files.pythonhosted.org/packages/93/91/833099373206b10ae42e82535dab74f2d86ae66814a8c50759e6d2a9971f/glutil-0.1.3-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "040679f53e9ed56484364c031bda5642", "sha256": "081f51ed6a6a43c4ae367dcc755aaf16f08bbba9bad4ea17d80a990a47926e2d" }, "downloads": -1, "filename": "glutil-0.1.3.tar.gz", "has_sig": false, "md5_digest": "040679f53e9ed56484364c031bda5642", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 17405, "upload_time": "2019-06-24T21:21:38", "upload_time_iso_8601": "2019-06-24T21:21:38.216909Z", "url": "https://files.pythonhosted.org/packages/75/e2/a3ebdb4b954bb53e58043f7bdb3bad08839de63c22e7ebc5333b8dfc7af5/glutil-0.1.3.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "97b0969835f8124fb1a6791a7ec5928c", "sha256": "6146809514657850a5f7cdc364e28edc1a5479fd4aed4058157bc7d4bcd38f61" }, "downloads": -1, "filename": "glutil-0.2.0-py2-none-any.whl", "has_sig": false, "md5_digest": "97b0969835f8124fb1a6791a7ec5928c", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": ">=3.6.0", "size": 17199, "upload_time": "2019-07-17T19:53:24", "upload_time_iso_8601": "2019-07-17T19:53:24.645266Z", "url": "https://files.pythonhosted.org/packages/a5/2c/1d5b75e74a6153543f677da65c34be4a78a5ff5817576355848dce04f2b3/glutil-0.2.0-py2-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "23e51d23fe777c0f907172876eb706da", "sha256": "e51b3745a64f74dcdee3faf5045ab267614a12011b49013a872a187c4509c2b3" }, "downloads": -1, "filename": "glutil-0.2.0.tar.gz", "has_sig": false, "md5_digest": "23e51d23fe777c0f907172876eb706da", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 15133, "upload_time": "2019-07-17T19:53:26", "upload_time_iso_8601": "2019-07-17T19:53:26.172544Z", "url": "https://files.pythonhosted.org/packages/d4/9f/bba84eae304c91d9b0d0b18222f91a32ee13eea47da412c469c6378502d9/glutil-0.2.0.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "c810b99f7b294a74043bfcd6fe084cd1", "sha256": "b2bb2bef4e3528b71156cb26ecccabda3a298f786e8000a67e07d693540c14c9" }, "downloads": -1, "filename": "glutil-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "c810b99f7b294a74043bfcd6fe084cd1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.0", "size": 17198, "upload_time": "2019-07-17T20:03:18", "upload_time_iso_8601": "2019-07-17T20:03:18.367514Z", "url": "https://files.pythonhosted.org/packages/08/07/2279771eb4552ba9e8ceb630f4e9f94674101e99a0ff2bdd2cef2a8a9194/glutil-0.2.1-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "1cf40701019fb01dc90b5631278fbd1a", "sha256": "09d5d4fc5f7c8eb62ad19cc59e691538f1c919a28369d26440666f9c548544f6" }, "downloads": -1, "filename": "glutil-0.2.1.tar.gz", "has_sig": false, "md5_digest": "1cf40701019fb01dc90b5631278fbd1a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 18076, "upload_time": "2019-07-17T20:03:21", "upload_time_iso_8601": "2019-07-17T20:03:21.107941Z", "url": "https://files.pythonhosted.org/packages/7b/2d/16d5e257f639004a80ee4cebda90f85456b5aa94ac8551d2a5bc36c9dc1d/glutil-0.2.1.tar.gz", "yanked": false, "yanked_reason": null } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "1d8b1f1abd392cc1fa7754af08666508", "sha256": "532207768fe66da9dc5cdad4b82772cfdc908a7e3baf45c3f93030aff80eb49e" }, "downloads": -1, "filename": "glutil-1.0.0-py2-none-any.whl", "has_sig": false, "md5_digest": "1d8b1f1abd392cc1fa7754af08666508", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": ">=3.6.0", "size": 17776, "upload_time": "2019-10-29T18:22:28", "upload_time_iso_8601": "2019-10-29T18:22:28.915474Z", "url": "https://files.pythonhosted.org/packages/f9/dd/ad8f59a799720142319ecb1cd019082d8b2720cbd5d96f5415545239355e/glutil-1.0.0-py2-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "5d92c7c1742e478ceedfdee3a49edae5", "sha256": "57f7a6e8d65dbc69522e45eee076e3dc5ec03e0344a1597d9563a58fc3feca83" }, "downloads": -1, "filename": "glutil-1.0.0.tar.gz", "has_sig": false, "md5_digest": "5d92c7c1742e478ceedfdee3a49edae5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 15723, "upload_time": "2019-10-29T18:22:30", "upload_time_iso_8601": "2019-10-29T18:22:30.468873Z", "url": "https://files.pythonhosted.org/packages/13/6b/61343f811c28e4c6c4f079917ac9a7810843b5b79df786abc69d16ebea8d/glutil-1.0.0.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1d8b1f1abd392cc1fa7754af08666508", "sha256": "532207768fe66da9dc5cdad4b82772cfdc908a7e3baf45c3f93030aff80eb49e" }, "downloads": -1, "filename": "glutil-1.0.0-py2-none-any.whl", "has_sig": false, "md5_digest": "1d8b1f1abd392cc1fa7754af08666508", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": ">=3.6.0", "size": 17776, "upload_time": "2019-10-29T18:22:28", "upload_time_iso_8601": "2019-10-29T18:22:28.915474Z", "url": "https://files.pythonhosted.org/packages/f9/dd/ad8f59a799720142319ecb1cd019082d8b2720cbd5d96f5415545239355e/glutil-1.0.0-py2-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "5d92c7c1742e478ceedfdee3a49edae5", "sha256": "57f7a6e8d65dbc69522e45eee076e3dc5ec03e0344a1597d9563a58fc3feca83" }, "downloads": -1, "filename": "glutil-1.0.0.tar.gz", "has_sig": false, "md5_digest": "5d92c7c1742e478ceedfdee3a49edae5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 15723, "upload_time": "2019-10-29T18:22:30", "upload_time_iso_8601": "2019-10-29T18:22:30.468873Z", "url": "https://files.pythonhosted.org/packages/13/6b/61343f811c28e4c6c4f079917ac9a7810843b5b79df786abc69d16ebea8d/glutil-1.0.0.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }