{ "info": { "author": "Fredrick Galoso", "author_email": "fred+arbalest@dwolla.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Topic :: Database :: Database Engines/Servers", "Topic :: System :: Distributed Computing" ], "description": "arbalest\n========\n\n.. image:: https://travis-ci.org/Dwolla/arbalest.svg?branch=master\n :target: https://travis-ci.org/Dwolla/arbalest\n\n.. image:: https://readthedocs.org/projects/arbalest/badge/?version=latest\n :target: http://arbalest.readthedocs.org/en/latest/?badge=latest\n\n`Arbalest`_ is a Python data pipeline orchestration library for `Amazon S3`_\nand `Amazon Redshift`_.\nIt takes care of the heavy lifting of making data queryable at scale in AWS.\n\n.. _Arbalest: https://github.com/Dwolla/arbalest\n.. _Amazon S3: https://aws.amazon.com/documentation/s3\n.. _Amazon Redshift: https://aws.amazon.com/documentation/redshift\n\nIt takes care of:\n\n* Ingesting data into Amazon Redshift\n* Schema creation and validation\n* Creating highly available and scalable data import strategies\n* Generating and uploading prerequisite artifacts for import\n* Running data import jobs\n* Orchestrating idempotent and fault tolerant multi-step ETL pipelines with SQL\n\n**Why Arbalest?**\n\n* Lightweight library over heavyweight frameworks that can be composed with existing data tools\n* Python is a `de facto `_ `lingua `_ `franca `_ for data science\n* Configuration as code\n* Batteries included, for example, strategies for ingesting time series or sparse data (``arbalest.pipeline``), or integration with an existing pipeline topology (``arbalest.contrib``)\n\n**Use Cases**\n\nArbalest is not a MapReduce framework, but rather designed to make Amazon Redshift (and all its strengths) easy to use\nwith typical data workflows and tools. Here are a few examples:\n\n* You are already using a `MapReduce `_ `framework `_ to process data in S3. Arbalest could make the results of an `Elastic MapReduce `_ job queryable with SQL in Redshift. You can then hand off to Arbalest to define additional ETL in plain old SQL.\n* You treat S3 as a catch all data sink, perhaps persisting JSON messages or events from a message system like `Kafka `_ or RabbitMQ. Arbalest can expose some or all of this data into a data warehouse using Redshift. The ecosystem of SQL is now available for dashboards, reports, ad-hoc analysis.\n* You have complex pipelines that could benefit from a fast, SQL queryable data sink. Arbalest has support out of the box (`arbalest.contrib`) to integrate with tools like `Luigi `_ to be part of a multi-dependency, multi-step pipeline topology.\n\nGetting Started\n---------------\n\nGetting started is easy with ``pip``:\n\n``pip install arbalest``\n\nExamples of Arbalest pipeline are in ``examples/``. An overview of concepts and classes are below.\n\n`Note`\n\nArbalest depends on psycopg2. However, installing psycopg2 on Windows may not be straight forward.\n\nTo install psycopg2 on Windows:\n\n64 bit Python installation:\n\n``pip install -e git+https://github.com/nwcell/psycopg2-windows.git@win64-py27#egg=psycopg2``\n\n32 bit Python installation:\n\n``pip install -e git+https://github.com/nwcell/psycopg2-windows.git@win32-py27#egg=psycopg2``\n\nPipelines\n---------\n\nArbalest orchestrates data loading using pipelines. Each ``Pipeline``\ncan have one or many steps that are made up of three parts:\n\n``metadata``: Path in an S3 bucket to store information needed for the copy process.\n\n``source``: Path in an S3 bucket where data to be copied from is located consisting of JSON object files:\n\n``{ \"id\": \"66bc8153-d6d9-4351-bada-803330f22db7\", \"someNumber\": 1 }``\n\n``schema``: Definition of JSON objects to map into Redshift rows.\n\nSchemas\n-------\n\nA schema is defined using a `JsonObject` mapper which consists of one or many `Property` declarations.\nBy default the name of the JSON property is used as the column, but can be set\nto a custom column name. Column names have a\n`maximum length of 127 characters `_. Column names\nlonger than 127 characters will be truncated.\nNested properties will create a default column name delimited by an underscore.\n\nExample JSON Object (whitespace for clarity)::\n\n {\n \"id\": \"66bc8153-d6d9-4351-bada-803330f22db7\",\n \"someNumber\": 1,\n \"child\" : {\n \"someBoolean\": true\n }\n }\n\nExample Schema:\n\n.. code-block:: python\n\n JsonObject('destination_table_name',\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER', 'custom_column_name'),\n Property('child', Property('someBoolean', 'BOOLEAN')))\n\nCopy Strategies\n---------------\n\nThe ``S3CopyPipeline`` supports different strategies for copying data from S3 to Redshift.\n\n**Bulk copy**\n\nBulk copy imports all keys in an S3 path into a Redshift table using a staging table.\nBy dropping and reimporting all data, duplication is eliminated.\nThis type of copy is useful for data that does not change very often or will\nonly be ingested once (e.g. immutable time series).\n\n**Manifest copy**\n\nA manifest copy imports all keys in an S3 path into a Redshift table using a `manifest `_.\nIn addition, a journal of successfully imported objects is persisted to the ``metadata`` path.\nSubsequent runs of this copy step will only copy S3 keys that do not exist in the journal.\nThis type of copy is useful for data in a path that changes often.\n\nExample data copies:\n\n.. code-block:: python\n\n #!/usr/bin/env python\n import psycopg2\n from arbalest.configuration import env\n from arbalest.redshift import S3CopyPipeline\n from arbalest.redshift.schema import JsonObject, Property\n\n if __name__ == '__main__':\n pipeline = S3CopyPipeline(\n aws_access_key_id=env('AWS_ACCESS_KEY_ID'),\n aws_secret_access_key=env('AWS_SECRET_ACCESS_KEY'),\n bucket=env('BUCKET_NAME'),\n db_connection=psycopg2.connect(env('REDSHIFT_CONNECTION')))\n\n pipeline.bulk_copy(metadata='path_to_save_pipeline_metadata',\n source='path_of_source_data',\n schema=JsonObject('destination_table_name',\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER',\n 'custom_column_name')))\n\n pipeline.manifest_copy(metadata='path_to_save_pipeline_metadata',\n source='path_of_incremental_source_data',\n schema=JsonObject('incremental_destination_table_name',\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER',\n 'custom_column_name')))\n\n pipeline.run()\n\nSQL\n---\n\nPipelines can also have arbitrary SQL steps.\nEach SQL step can have one or many statements which are executed in a transaction, for example, orchestrating additional ETL (extract, transform, and load).\nExpanding on the previous example:\n\n.. code-block:: python\n\n #!/usr/bin/env python\n import psycopg2\n from arbalest.configuration import env\n from arbalest.redshift import S3CopyPipeline\n from arbalest.redshift.schema import JsonObject, Property\n\n if __name__ == '__main__':\n pipeline = S3CopyPipeline(\n aws_access_key_id=env('AWS_ACCESS_KEY_ID'),\n aws_secret_access_key=env('AWS_SECRET_ACCESS_KEY'),\n bucket=env('BUCKET_NAME'),\n db_connection=psycopg2.connect(env('REDSHIFT_CONNECTION')))\n\n pipeline.bulk_copy(metadata='path_to_save_pipeline_metadata',\n source='path_of_source_data',\n schema=JsonObject('destination_table_name',\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER',\n 'custom_column_name')))\n\n pipeline.manifest_copy(metadata='path_to_save_pipeline_metadata',\n source='path_of_incremental_source_data',\n schema=JsonObject('incremental_destination_table_name',\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER',\n 'custom_column_name')))\n\n pipeline.sql(('SELECT someNumber + %s '\n 'INTO some_olap_table FROM destination_table_name', 1),\n ('SELECT * INTO destination_table_name_copy '\n 'FROM destination_table_name'))\n\n pipeline.run()\n\nOrchestration Helpers\n---------------------\n\nIncluded in this project are a variety of orchestration helpers to assist with\nthe creation of pipelines.\nThese classes are defined in the ``arbalest.pipeline`` and ``arbalest.contrib`` modules.\n\nSorted Data Sources\n~~~~~~~~~~~~~~~~~~~\n\nAssuming source data is stored in a sortable series of directories, ``S3SortedDataSources``\nfacilitates the retrieval of S3 paths in a sequence for import, given a start\nand/or end. In addition, it has methods to mark a cursor in an S3 persisted journal.\n\n**Examples of data stored as a sorted series**\n\nSequential integers::\n\n s3://bucket/child/1/*\n s3://bucket/child/2/*\n s3://bucket/child/3/*\n\nTime series::\n\n s3://bucket/child/2015-01-01/*\n s3://bucket/child/2015-01-02/*\n s3://bucket/child/2015-01-03/*\n s3://bucket/child/2015-01-04/00/*\n\n**Example of sorted data source class**\n\n.. code-block:: python\n\n S3SortedDataSources(\n metadata='',\n source='child',\n bucket=bucket,\n start=env('START'),\n end=env('END'))\n\nTime Series\n~~~~~~~~~~~\n\n``SqlTimeSeriesImport`` implements a bulk copy and update strategy of data from\na list of time series sources from ``S3SortedDataSources`` into an existing\ntarget table.\n\n**Example time series import from an S3 time series topology, ingesting a day of objects**\n\nTime series path topology::\n\n s3://bucket/child/2015-01-01/*\n s3://bucket/child/2015-01-02/*\n\n.. code-block:: python\n\n ExamplePipeline(S3CopyPipeline):\n def __init__(self,\n aws_access_key_id,\n aws_secret_access_key,\n bucket,\n db_connection):\n super(ExamplePipeline, self).__init__(\n aws_access_key_id,\n aws_secret_access_key,\n bucket,\n db_connection)\n\n # Create table to ingest data into if it does not exist\n self.sql('CREATE target_table IF NOT EXISTS target_table(id VARCHAR(36), someNumber INTEGER, timestamp TIMESTAMP);')\n\n time_series = SqlTimeSeriesImport(\n destination_table='target_table',\n update_date='2015-01-01', # Replace existing events, if any, after this timestamp\n sources=S3SortedDataSources(\n metadata='',\n source='child',\n bucket=bucket,\n start='2015-01-01',\n end='2015-01-02'),\n Property('id', 'VARCHAR(36)'),\n Property('someNumber', 'INTEGER'),\n Property('timestamp', 'TIMESTAMP'))\n\n # Populate target_table using a bulk copy per day\n time_series.bulk_copy(\n pipeline=self,\n metadata='',\n max_error=1000, # Maximum errors tolerated by Redshift COPY\n order_by_column='timestamp') # Use column named timestamp to sort by and replace existing events, if any\n\nLuigi\n~~~~~\n\n``PipelineTask`` wraps any ``arbalest.core.Pipeline`` into a `Luigi Task `_.\nThis allows for the composition of workflows with dependency graphs, for example,\ndata pipelines that are dependent on multiple steps or other pipelines. Luigi then takes care of\nthe heavy lifting of\n`scheduling and executing `_\nmultistep pipelines.\n\nLicense\n-------\n\nArbalest is licensed under the `MIT License `_.\n\nAuthors and Contributors\n------------------------\n\nArbalest was built at Dwolla, primarily by `Fredrick Galoso `_.\nInitial support for Luigi and contributions to orchestration helpers by `Hayden Goldstien `_.\nWe gladly welcome contributions and feedback. If you are using Arbalest we would love to know.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Dwolla/arbalest", "keywords": null, "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "arbalest", "package_url": "https://pypi.org/project/arbalest/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/arbalest/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/Dwolla/arbalest" }, "release_url": "https://pypi.org/project/arbalest/1.6.2/", "requires_dist": null, "requires_python": null, "summary": "Arbalest orchestrates bulk data loading for Amazon Redshift", "version": "1.6.2" }, "last_serial": 1875540, "releases": { "1.6.0": [ { "comment_text": "", "digests": { "md5": "b09bce185589b06ef2057d3168537126", "sha256": "f8c7e975789991ad63f0342e03839c5034b1cae7f86534340eab91a2c4d75639" }, "downloads": -1, "filename": "arbalest-1.6.0.tar.gz", "has_sig": false, "md5_digest": "b09bce185589b06ef2057d3168537126", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10822, "upload_time": "2015-10-21T19:26:22", "url": "https://files.pythonhosted.org/packages/a7/52/cdeb8858d61c54dc329a9e591360eac0e5ea1e6904a0e98128a6fd8c61ae/arbalest-1.6.0.tar.gz" } ], "1.6.1": [ { "comment_text": "", "digests": { "md5": "c59baf150d44e965fbaa0d8e0e1814ee", "sha256": "af20aaab6c00e92bfc85fe056cd999c9bd5aa4ea3834cd54b9eba372a9c96867" }, "downloads": -1, "filename": "arbalest-1.6.1.tar.gz", "has_sig": false, "md5_digest": "c59baf150d44e965fbaa0d8e0e1814ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12207, "upload_time": "2015-10-21T19:33:21", "url": "https://files.pythonhosted.org/packages/51/50/80ae00b92eabe310bcc22e5c53c5e78811e48d9e338966ff225b90fee521/arbalest-1.6.1.tar.gz" } ], "1.6.2": [ { "comment_text": "", "digests": { "md5": "07aaf046a7e94079a8da2488ae02a4ef", "sha256": "9c69f96750e9485573cf1aeecd866030559030289513f5e3cad880c32cdb217b" }, "downloads": -1, "filename": "arbalest-1.6.2.tar.gz", "has_sig": false, "md5_digest": "07aaf046a7e94079a8da2488ae02a4ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12969, "upload_time": "2015-10-27T13:50:59", "url": "https://files.pythonhosted.org/packages/47/0d/0f503f12b57e7a6722680ded3b358fc5fa56a33e462cf683cb1798035dbd/arbalest-1.6.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "07aaf046a7e94079a8da2488ae02a4ef", "sha256": "9c69f96750e9485573cf1aeecd866030559030289513f5e3cad880c32cdb217b" }, "downloads": -1, "filename": "arbalest-1.6.2.tar.gz", "has_sig": false, "md5_digest": "07aaf046a7e94079a8da2488ae02a4ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12969, "upload_time": "2015-10-27T13:50:59", "url": "https://files.pythonhosted.org/packages/47/0d/0f503f12b57e7a6722680ded3b358fc5fa56a33e462cf683cb1798035dbd/arbalest-1.6.2.tar.gz" } ] }