{ "info": { "author": "Mara contributors", "author_email": "", "bugtrack_url": null, "classifiers": [], "description": "# Mara Data Integration\n\n[![Build Status](https://travis-ci.org/mara/data-integration.svg?branch=master)](https://travis-ci.org/mara/data-integration)\n[![PyPI - License](https://img.shields.io/pypi/l/data-integration.svg)](https://github.com/mara/data-integration/blob/master/LICENSE)\n[![PyPI version](https://badge.fury.io/py/data-integration.svg)](https://badge.fury.io/py/data-integration)\n[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://communityinviter.com/apps/mara-users/public-invite)\n\n\nThis package contains a lightweight ETL framework with a focus on transparency and complexity reduction. It has a number of baked-in assumptions/ principles:\n\n- Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code.\n\n- PostgreSQL as a data processing engine.\n\n- Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines.\n\n- GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows.\n\n- No in-app data processing: command line tools as the main tool for interacting with databases and data.\n\n- Single machine pipeline execution based on Python's [multiprocessing](https://docs.python.org/3.6/library/multiprocessing.html). No need for distributed task queues. Easy debugging and output logging.\n\n- Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.\n\n \n\n## Installation\n\nTo use the library directly, use pip:\n\n```\npip install data-integration\n```\n\nor\n \n```\npip install git+https://github.com/mara/data-integration.git\n```\n\nFor an example of an integration into a flask application, have a look at the [mara example project](https://github.com/mara/mara-example-project).\n\n\n \n\n## Example\n\nHere is a pipeline \"demo\" consisting of three nodes that depend on each other: the task `ping_localhost`, the pipeline `sub_pipeline` and the task `sleep`:\n\n```python\nfrom data_integration.commands.bash import RunBash\nfrom data_integration.pipelines import Pipeline, Task\nfrom data_integration.ui.cli import run_pipeline, run_interactively\n\npipeline = Pipeline(\n id='demo',\n description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')\n\npipeline.add(Task(id='ping_localhost', description='Pings localhost',\n commands=[RunBash('ping -c 3 localhost')]))\n\nsub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')\n\nfor host in ['google', 'amazon', 'facebook']:\n sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',\n commands=[RunBash(f'ping -c 3 {host}.com')]))\n\nsub_pipeline.add_dependency('ping_amazon', 'ping_facebook')\nsub_pipeline.add(Task(id='ping_foo', description='Pings foo',\n commands=[RunBash('ping foo')]), ['ping_amazon'])\n\npipeline.add(sub_pipeline, ['ping_localhost'])\n\npipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',\n commands=[RunBash('sleep 2')]), ['sub_pipeline'])\n```\n\nTasks contain lists of commands, which do the actual work (in this case running bash commands that ping various hosts). \n\n \n\nIn order to run the pipeline, a PostgreSQL database needs to be configured for storing run-time information, run output and status of incremental processing: \n\n```python\nimport mara_db.auto_migration\nimport mara_db.config\nimport mara_db.dbs\n\nmara_db.config.databases \\\n = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='root', database='example_etl_mara')}\n\nmara_db.auto_migration.auto_discover_models_and_migrate()\n```\n\nGiven that PostgresSQL is running and the credentials work, the output looks like this (a database with a number of tables is created):\n\n```\nCreated database \"postgresql+psycopg2://root@localhost/example_etl_mara\"\n\nCREATE TABLE data_integration_file_dependency (\n node_path TEXT[] NOT NULL, \n dependency_type VARCHAR NOT NULL, \n hash VARCHAR, \n timestamp TIMESTAMP WITHOUT TIME ZONE, \n PRIMARY KEY (node_path, dependency_type)\n);\n\n.. more tables\n```\n\n### CLI UI\n\nThis runs a pipeline with output to stdout:\n\n```python\nfrom data_integration.ui.cli import run_pipeline\n\nrun_pipeline(pipeline)\n```\n\n![Example run cli 1](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-1.gif)\n\n \n\nAnd this runs a single node of pipeline `sub_pipeline` together with all the nodes that it depends on:\n\n```python\nrun_pipeline(sub_pipeline, nodes=[sub_pipeline.nodes['ping_amazon']], with_upstreams=True)\n```\n\n![Example run cli 2](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-2.gif)\n\n \n\n\nAnd finally, there is some sort of menu based on [pythondialog](http://pythondialog.sourceforge.net/) that allows to navigate and run pipelines like this:\n\n```python\nfrom data_integration.ui.cli import run_interactively\n\nrun_interactively()\n```\n\n![Example run cli 3](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-3.gif)\n\n\n\n### Web UI\n\nMore importantly, this package provides an extensive web interface. It can be easily integrated into any [Flask](http://flask.pocoo.org/) based app and the [mara example project](https://github.com/mara/mara-example-project) demonstrates how to do this using [mara-app](https://github.com/mara/mara-app).\n\nFor each pipeline, there is a page that shows\n\n- a graph of all child nodes and the dependencies between them\n- a chart of the overal run time of the pipeline and it's most expensive nodes over the last 30 days (configurable)\n- a table of all the pipeline's nodes with their average run times and the resulting queuing priority\n- output and timeline for the last runs of the pipeline\n\n\n![Mara data integration web ui 1](https://github.com/mara/data-integration/raw/master/docs/mara-data-integration-web-ui-1.png)\n\nFor each task, there is a page showing \n\n- the upstreams and downstreams of the task in the pipeline\n- the run times of the task in the last 30 days\n- all commands of the task\n- output of the last runs of the task\n\n![Mara data integration web ui 2](https://github.com/mara/data-integration/raw/master/docs/mara-data-integration-web-ui-2.png)\n\n\nPipelines and tasks can be run from the web ui directly, which is probably one of the main features of this package: \n\n![Example run web ui](https://github.com/mara/data-integration/raw/master/docs/example-run-web-ui.gif)\n\n \n\n# Getting started\n\nDocumentation is currently work in progress. Please use the [mara example project](https://github.com/mara/mara-example-project) as a reference for getting started.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/mara/data-integration", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "data-integration", "package_url": "https://pypi.org/project/data-integration/", "platform": "", "project_url": "https://pypi.org/project/data-integration/", "project_urls": { "Homepage": "https://github.com/mara/data-integration" }, "release_url": "https://pypi.org/project/data-integration/2.5.0/", "requires_dist": null, "requires_python": ">=3.6", "summary": "Opinionated lightweight ETL pipeline framework", "version": "2.5.0" }, "last_serial": 5496517, "releases": { "2.4.1": [ { "comment_text": "", "digests": { "md5": "48e7b7137652d9b1130bf171ea322dbe", "sha256": "ba6696290b4218e47de683973be21ee3be293957ff76f90a3a66250d253c5994" }, "downloads": -1, "filename": "data-integration-2.4.1.tar.gz", "has_sig": false, "md5_digest": "48e7b7137652d9b1130bf171ea322dbe", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 40866, "upload_time": "2019-07-05T07:53:38", "url": "https://files.pythonhosted.org/packages/e4/f6/9e4cb99a54b7b61ae8bd9aba788a1e4889aa4305f59d9dd1eecba352d418/data-integration-2.4.1.tar.gz" } ], "2.4.2": [ { "comment_text": "", "digests": { "md5": "ac64dfeeb5c6e436ccf0386249b21b07", "sha256": "774742100de37902813f36315e3e380f8cb5bf240017fac1b3794a45b7286b72" }, "downloads": -1, "filename": "data-integration-2.4.2.tar.gz", "has_sig": false, "md5_digest": "ac64dfeeb5c6e436ccf0386249b21b07", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 40897, "upload_time": "2019-07-05T08:20:27", "url": "https://files.pythonhosted.org/packages/c5/54/886c0e22dd0ac6947f80b8f5909033b67803ac14029dd3c59c6c59b3a84b/data-integration-2.4.2.tar.gz" } ], "2.5.0": [ { "comment_text": "", "digests": { "md5": "c6f64971b66afc3010c360144b9c562f", "sha256": "b027245266fcaca9c4813c84ad6d99a79bc41e154f3d8682d732965fd7cb4703" }, "downloads": -1, "filename": "data-integration-2.5.0.tar.gz", "has_sig": false, "md5_digest": "c6f64971b66afc3010c360144b9c562f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 40909, "upload_time": "2019-07-07T08:21:00", "url": "https://files.pythonhosted.org/packages/1a/36/24d00c9889bab2dc26850a2bba710bd2065512a6d79965ae49396792ce6a/data-integration-2.5.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c6f64971b66afc3010c360144b9c562f", "sha256": "b027245266fcaca9c4813c84ad6d99a79bc41e154f3d8682d732965fd7cb4703" }, "downloads": -1, "filename": "data-integration-2.5.0.tar.gz", "has_sig": false, "md5_digest": "c6f64971b66afc3010c360144b9c562f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 40909, "upload_time": "2019-07-07T08:21:00", "url": "https://files.pythonhosted.org/packages/1a/36/24d00c9889bab2dc26850a2bba710bd2065512a6d79965ae49396792ce6a/data-integration-2.5.0.tar.gz" } ] }