{
    "info": {
        "author": "Mara contributors",
        "author_email": "",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "# Mara Data Integration\n\n[![Build Status](https://travis-ci.org/mara/data-integration.svg?branch=master)](https://travis-ci.org/mara/data-integration)\n[![PyPI - License](https://img.shields.io/pypi/l/data-integration.svg)](https://github.com/mara/data-integration/blob/master/LICENSE)\n[![PyPI version](https://badge.fury.io/py/data-integration.svg)](https://badge.fury.io/py/data-integration)\n[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://communityinviter.com/apps/mara-users/public-invite)\n\n\nThis package contains a lightweight ETL framework with a focus on transparency and complexity reduction. It has a number of baked-in assumptions/ principles:\n\n- Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code.\n\n- PostgreSQL as a data processing engine.\n\n- Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines.\n\n- GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows.\n\n- No in-app data processing: command line tools as the main tool for interacting with databases and data.\n\n- Single machine pipeline execution based on Python's [multiprocessing](https://docs.python.org/3.6/library/multiprocessing.html). No need for distributed task queues. Easy debugging and output logging.\n\n- Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.\n\n&nbsp;\n\n## Installation\n\nTo use the library directly, use pip:\n\n```\npip install data-integration\n```\n\nor\n \n```\npip install git+https://github.com/mara/data-integration.git\n```\n\nFor an example of an integration into a flask application, have a look at the [mara example project](https://github.com/mara/mara-example-project).\n\n\n&nbsp;\n\n## Example\n\nHere is a pipeline \"demo\" consisting of three nodes that depend on each other: the task `ping_localhost`, the pipeline `sub_pipeline` and the task `sleep`:\n\n```python\nfrom data_integration.commands.bash import RunBash\nfrom data_integration.pipelines import Pipeline, Task\nfrom data_integration.ui.cli import run_pipeline, run_interactively\n\npipeline = Pipeline(\n    id='demo',\n    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')\n\npipeline.add(Task(id='ping_localhost', description='Pings localhost',\n                  commands=[RunBash('ping -c 3 localhost')]))\n\nsub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')\n\nfor host in ['google', 'amazon', 'facebook']:\n    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',\n                          commands=[RunBash(f'ping -c 3 {host}.com')]))\n\nsub_pipeline.add_dependency('ping_amazon', 'ping_facebook')\nsub_pipeline.add(Task(id='ping_foo', description='Pings foo',\n                      commands=[RunBash('ping foo')]), ['ping_amazon'])\n\npipeline.add(sub_pipeline, ['ping_localhost'])\n\npipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',\n                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])\n```\n\nTasks contain lists of commands, which do the actual work (in this case running bash commands that ping various hosts). \n\n&nbsp;\n\nIn order to run the pipeline, a PostgreSQL database needs to be configured for storing run-time information, run output and status of incremental processing: \n\n```python\nimport mara_db.auto_migration\nimport mara_db.config\nimport mara_db.dbs\n\nmara_db.config.databases \\\n    = lambda: {'mara': mara_db.dbs.PostgreSQLDB(host='localhost', user='root', database='example_etl_mara')}\n\nmara_db.auto_migration.auto_discover_models_and_migrate()\n```\n\nGiven that PostgresSQL is running and the credentials work, the output looks like this (a database with a number of tables is created):\n\n```\nCreated database \"postgresql+psycopg2://root@localhost/example_etl_mara\"\n\nCREATE TABLE data_integration_file_dependency (\n    node_path TEXT[] NOT NULL, \n    dependency_type VARCHAR NOT NULL, \n    hash VARCHAR, \n    timestamp TIMESTAMP WITHOUT TIME ZONE, \n    PRIMARY KEY (node_path, dependency_type)\n);\n\n.. more tables\n```\n\n### CLI UI\n\nThis runs a pipeline with output to stdout:\n\n```python\nfrom data_integration.ui.cli import run_pipeline\n\nrun_pipeline(pipeline)\n```\n\n![Example run cli 1](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-1.gif)\n\n&nbsp;\n\nAnd this runs a single node of pipeline `sub_pipeline` together with all the nodes that it depends on:\n\n```python\nrun_pipeline(sub_pipeline, nodes=[sub_pipeline.nodes['ping_amazon']], with_upstreams=True)\n```\n\n![Example run cli 2](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-2.gif)\n\n&nbsp;\n\n\nAnd finally, there is some sort of menu based on [pythondialog](http://pythondialog.sourceforge.net/) that allows to navigate and run pipelines like this:\n\n```python\nfrom data_integration.ui.cli import run_interactively\n\nrun_interactively()\n```\n\n![Example run cli 3](https://github.com/mara/data-integration/raw/master/docs/example-run-cli-3.gif)\n\n\n\n### Web UI\n\nMore importantly, this package provides an extensive web interface. It can be easily integrated into any [Flask](http://flask.pocoo.org/) based app and the [mara example project](https://github.com/mara/mara-example-project) demonstrates how to do this using [mara-app](https://github.com/mara/mara-app).\n\nFor each pipeline, there is a page that shows\n\n- a graph of all child nodes and the dependencies between them\n- a chart of the overal run time of the pipeline and it's most expensive nodes over the last 30 days (configurable)\n- a table of all the pipeline's nodes with their average run times and the resulting queuing priority\n- output and timeline for the last runs of the pipeline\n\n\n![Mara data integration web ui 1](https://github.com/mara/data-integration/raw/master/docs/mara-data-integration-web-ui-1.png)\n\nFor each task, there is a page showing \n\n- the upstreams and downstreams of the task in the pipeline\n- the run times of the task in the last 30 days\n- all commands of the task\n- output of the last runs of the task\n\n![Mara data integration web ui 2](https://github.com/mara/data-integration/raw/master/docs/mara-data-integration-web-ui-2.png)\n\n\nPipelines and tasks can be run from the web ui directly, which is probably one of the main features of this package: \n\n![Example run web ui](https://github.com/mara/data-integration/raw/master/docs/example-run-web-ui.gif)\n\n&nbsp;\n\n# Getting started\n\nDocumentation is currently work in progress. Please use the [mara example project](https://github.com/mara/mara-example-project) as a reference for getting started.",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/mara/data-integration",
        "keywords": "",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "data-integration",
        "package_url": "https://pypi.org/project/data-integration/",
        "platform": "",
        "project_url": "https://pypi.org/project/data-integration/",
        "project_urls": {
            "Homepage": "https://github.com/mara/data-integration"
        },
        "release_url": "https://pypi.org/project/data-integration/2.5.0/",
        "requires_dist": null,
        "requires_python": ">=3.6",
        "summary": "Opinionated lightweight ETL pipeline framework",
        "version": "2.5.0"
    },
    "last_serial": 5496517,
    "releases": {
        "2.4.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "48e7b7137652d9b1130bf171ea322dbe",
                    "sha256": "ba6696290b4218e47de683973be21ee3be293957ff76f90a3a66250d253c5994"
                },
                "downloads": -1,
                "filename": "data-integration-2.4.1.tar.gz",
                "has_sig": false,
                "md5_digest": "48e7b7137652d9b1130bf171ea322dbe",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3.6",
                "size": 40866,
                "upload_time": "2019-07-05T07:53:38",
                "url": "https://files.pythonhosted.org/packages/e4/f6/9e4cb99a54b7b61ae8bd9aba788a1e4889aa4305f59d9dd1eecba352d418/data-integration-2.4.1.tar.gz"
            }
        ],
        "2.4.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "ac64dfeeb5c6e436ccf0386249b21b07",
                    "sha256": "774742100de37902813f36315e3e380f8cb5bf240017fac1b3794a45b7286b72"
                },
                "downloads": -1,
                "filename": "data-integration-2.4.2.tar.gz",
                "has_sig": false,
                "md5_digest": "ac64dfeeb5c6e436ccf0386249b21b07",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3.6",
                "size": 40897,
                "upload_time": "2019-07-05T08:20:27",
                "url": "https://files.pythonhosted.org/packages/c5/54/886c0e22dd0ac6947f80b8f5909033b67803ac14029dd3c59c6c59b3a84b/data-integration-2.4.2.tar.gz"
            }
        ],
        "2.5.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "c6f64971b66afc3010c360144b9c562f",
                    "sha256": "b027245266fcaca9c4813c84ad6d99a79bc41e154f3d8682d732965fd7cb4703"
                },
                "downloads": -1,
                "filename": "data-integration-2.5.0.tar.gz",
                "has_sig": false,
                "md5_digest": "c6f64971b66afc3010c360144b9c562f",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3.6",
                "size": 40909,
                "upload_time": "2019-07-07T08:21:00",
                "url": "https://files.pythonhosted.org/packages/1a/36/24d00c9889bab2dc26850a2bba710bd2065512a6d79965ae49396792ce6a/data-integration-2.5.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "c6f64971b66afc3010c360144b9c562f",
                "sha256": "b027245266fcaca9c4813c84ad6d99a79bc41e154f3d8682d732965fd7cb4703"
            },
            "downloads": -1,
            "filename": "data-integration-2.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c6f64971b66afc3010c360144b9c562f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 40909,
            "upload_time": "2019-07-07T08:21:00",
            "url": "https://files.pythonhosted.org/packages/1a/36/24d00c9889bab2dc26850a2bba710bd2065512a6d79965ae49396792ce6a/data-integration-2.5.0.tar.gz"
        }
    ]
}