{ "info": { "author": "Hj\u00f6rtur Hjartarson", "author_email": "hjorturh88@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: System Administrators", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.1", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "# DBloy\n\nA Databricks deployment CLI tool to enable Continuous Delivery of PySpark Notebooks based jobs.\n\n\n## Installation\n\n````bash\n$ pip install dbloy\n````\n\n## Usage\n\nAuthenticate with Databricks using authentication token:\n\n```bash\n$ dbloy configure \n```\n\nUpdate Databricks Job \n\n```bash\n$ dbloy apply --deploy-yml deploy.yml --configmap-yml configmap.yml --version \n```\n\nwhere `deploy.yml` and `configmap.yml` contain the Job specification. The Job version is specified in ``\n\n\n## Workflow\n\n![databricks workflow](https://databricks.com/wp-content/uploads/2017/10/CI-CD-BLOG4@2x-1024x211.png \"databricks workflow\")\nsource: https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html \n \n![example workflow](https://github.com/hjh17/dbloy/blob/master/uml.png?raw=true \"example workflow\")\n \n\n\n \n## Example Usage\n\nSee [example/gitlab_my-etl-job](https://github.com/hjh17/dbloy/tree/master/example/gitlab_my-etl-job) for a example ETL repository using Gitlab's CI/CD.\n\n\nA Deployment requires the following:\n\n* Deployment manifest\n* Configuration manifest\n* A main Databricks Notebook source file available locally. \n* (Optional) Attached python library containing the core logic. This allows easier unit testing of \n\n\n### Creating a Deployment\n\n\n\ndeploy.yml\n\n````yaml\nkind: Deployment\nmetadata:\n name: my-etl-job\n workspace: Shared\ntemplate:\n job:\n name: My ETL Job\n notifications:\n email:\n no_alert_for_skipped_runs: true\n on_failure :\n - my_email@my_org.com\n base_notebook: main\n notebooks:\n - EPHEMERAL_NOTEBOOK_1: notebook_name1\n - EPHEMERAL_NOTEBOOK_2: notebook_name2\n libraries:\n - egg_main: dbfs:/python35/my_python_lib/my_python_lib-VERSION-py3.5.egg\n - egg: dbfs:/python35/static_python_lib.egg\n - pypi:\n package: scikit-learn==0.20.3\n - pypi:\n package: statsmodels==0.10.1\n - pypi:\n package: prometheus-client==0.7.1\n - jar: dbfs:/FileStore/jars/e9b87e4c_c754_4707_a62a_44ef47535b39-azure_cosmosdb_spark_2_4_0_2_11_1_3_4_uber-38021.jar\n run:\n max_concurrent_runs: 1\n max_retries: 1\n min_retry_interval_millis: 600000\n retry_on_timeout: true\n timeout_seconds: 10800\n````\n\nconfigmap.yml\n\n````yaml\nkind: ConfigMap\nmetadata:\n namespace: production\nparams:\n DB_URL: production_db_url_1\n DB_PASSWORD: production_password123\njob:\n id: 289\n schedule:\n quartz_cron_expression: \"0 0 0 * * ?\"\n timezone_id: \"Europe/Berlin\"\n max_retries: \"1\"\ncluster:\n spark_version: \"5.3.x-scala2.11\"\n node_type_id: \"Standard_DS3_v2\"\n driver_node_type_id: \"Standard_DS3_v2\"\n autoscale:\n min_workers: 1\n max_workers: 2\n spark_env_vars:\n PYSPARK_PYTHON: \"/databricks/python3/bin/python3\"\n\n````\n\nIn this example:\n\n* Job id `289` on Databricks, indicated by the `.job.id` field in `configmap.yml`, will be updated with the name `My ETL Job`, indicated by the `.template.job.name` field in `deploy.yml`.\n* A cluster will be created on demand which is specified by the field `.cluster` in `configmap.yml`. See https://docs.databricks.com/api/latest/clusters.html#request-structure for a complete list of cluster settings. **Note**: Setting `.cluster.existing_cluster_id` will use an existing cluster. \n* Libraries specified by the field `.template.libraries` in `.deploy.yml` will be installed on the cluster. See https://docs.databricks.com/api/latest/libraries.html#library. \n **Note**: The field `.template.libraries.egg_main` is reserved for python `.egg` file that is versioned with the ETL job. \n For example when the main logic of the ETL job is put into a library. The `.egg` version number is expected to be the same as the ETL version number.\n* The main task notebook that will be executed by the job is defined by the field `.template.base_notebook` in `deploy.yml`. Task parameters are specified by the field `.params` in `configmap.yml` which will be accessible in the Notebooks via `dbutils`.\n* The notebook `main`, indicated by the field `.template.base_notebook` is the Task notebook. This notebook should be found in the workspace `/Shared/my-etl-job//main` specified by the fields `.metadata` and `.template.base_notebook` in `deploy.yml`. The version number `` will be specified in the CLI command.\n* Two ephemeral notebooks are available under `/Shared/my-etl-job//notebook_name1` and `/Shared/my-etl-job//notebook_name2`. This allows the main task to execute nested Notebooks, e.g.\n```\nnotebook_path_1 = dbutils.widgets.get(\"EPHEMERAL_NOTEBOOK_1\")\ndbutils.notebook.run(notebook_path_1)\n```\n \n \nCreate the Deployment by running the following command:\n\n```bash\n$ dbloy apply --deploy-yml deploy.yml --configmap-yml configmap.yml --version \n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/hjh17/dbloy", "keywords": "databricks cli ci/cd", "license": "['MIT']", "maintainer": "Hj\u00f6rtur Hjartarson", "maintainer_email": "hjorturh88@gmail.com", "name": "dbloy", "package_url": "https://pypi.org/project/dbloy/", "platform": "", "project_url": "https://pypi.org/project/dbloy/", "project_urls": { "Homepage": "https://github.com/hjh17/dbloy" }, "release_url": "https://pypi.org/project/dbloy/0.3.0/", "requires_dist": null, "requires_python": "", "summary": "Continuous Delivery tool for PySpark Notebooks based jobs on Databricks.", "version": "0.3.0" }, "last_serial": 5774523, "releases": { "0.3.0": [ { "comment_text": "", "digests": { "md5": "f6694e02bd5d6dec3eff08fc7c3577b6", "sha256": "b757ce138ab254fbeab17c83f4fee3d58bf974dfbb08173b11a3cfac769aafd9" }, "downloads": -1, "filename": "dbloy-0.3.0.tar.gz", "has_sig": false, "md5_digest": "f6694e02bd5d6dec3eff08fc7c3577b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5501, "upload_time": "2019-08-30T13:35:28", "url": "https://files.pythonhosted.org/packages/36/5d/852215b182841bb3e96d9e62859a94ccc18ba54f91a4ff7546d6fc15da18/dbloy-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f6694e02bd5d6dec3eff08fc7c3577b6", "sha256": "b757ce138ab254fbeab17c83f4fee3d58bf974dfbb08173b11a3cfac769aafd9" }, "downloads": -1, "filename": "dbloy-0.3.0.tar.gz", "has_sig": false, "md5_digest": "f6694e02bd5d6dec3eff08fc7c3577b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5501, "upload_time": "2019-08-30T13:35:28", "url": "https://files.pythonhosted.org/packages/36/5d/852215b182841bb3e96d9e62859a94ccc18ba54f91a4ff7546d6fc15da18/dbloy-0.3.0.tar.gz" } ] }