{ "info": { "author": "Fishtown Analytics", "author_email": "info@fishtownanalytics.com", "bugtrack_url": null, "classifiers": [], "description": "## dbt-spark\n\n### Documentation\nFor more information on using Spark with dbt, consult the [dbt documentation](https://docs.getdbt.com/docs/profile-spark).\n\n### Installation\nThis plugin can be installed via pip:\n```\n$ pip install dbt-spark\n```\n\n### Configuring your profile\n\n**Connection Method**\n\nConnections can be made to Spark in two different modes. The `http` mode is used when connecting to a managed service such as Databricks, which provides an HTTP endpoint; the `thrift` mode is used to connect directly to the master node of a cluster (either on-premise or in the cloud).\n\nA dbt profile can be configured to run against Spark using the following configuration:\n\n| Option | Description | Required? | Example |\n|---------|----------------------------------------------------|-------------------------|--------------------------|\n| method | Specify the connection method (`thrift` or `http`) | Required | `http` |\n| schema | Specify the schema (database) to build models into | Required | `analytics` |\n| host | The hostname to connect to | Required | `yourorg.sparkhost.com` |\n| port | The port to connect to the host on | Optional (default: 443 for `http`, 10001 for `thrift`) | `443` |\n| token | The token to use for authenticating to the cluster | Required for `http` | `abc123` |\n| cluster | The name of the cluster to connect to | Required for `http` | `01234-23423-coffeetime` |\n| user | The username to use to connect to the cluster | Optional | `hadoop` |\n| connect_timeout | The number of seconds to wait before retrying to connect to a Pending Spark cluster | Optional (default: 10) | `60` |\n| connect_retries | The number of times to try connecting to a Pending Spark cluster before giving up | Optional (default: 0) | `5` |\n\n**Usage with Amazon EMR**\n\nTo connect to Spark running on an Amazon EMR cluster, you will need to run `sudo /usr/lib/spark/sbin/start-thriftserver.sh` on the master node of the cluster to start the Thrift server (see https://aws.amazon.com/premiumsupport/knowledge-center/jdbc-connection-emr/ for further context). You will also need to connect to port `10001`, which will connect to the Spark backend Thrift server; port `10000` will instead connect to a Hive backend, which will not work correctly with dbt.\n\n\n**Example profiles.yml entries:**\n```\nyour_profile_name:\n target: dev\n outputs:\n dev:\n method: http\n type: spark\n schema: analytics\n host: yourorg.sparkhost.com\n port: 443\n token: abc123\n cluster: 01234-23423-coffeetime\n connect_retries: 5\n connect_timeout: 60\n```\n\n```\nyour_profile_name:\n target: dev\n outputs:\n dev:\n method: thrift\n type: spark\n schema: analytics\n host: 127.0.0.1\n port: 10001\n user: hadoop\n connect_retries: 5\n connect_timeout: 60\n```\n\n\n\n### Usage Notes\n\n**Model Configuration**\n\nThe following configurations can be supplied to models run with the dbt-spark plugin:\n\n\n| Option | Description | Required? | Example |\n|---------|----------------------------------------------------|-------------------------|--------------------------|\n| file_format | The file format to use when creating tables | Optional | `parquet` |\n\n\n\n**Incremental Models**\n\nSpark does not natively support `delete`, `update`, or `merge` statements. As such, [incremental models](https://docs.getdbt.com/docs/configuring-incremental-models)\nare implemented differently than usual in this plugin. To use incremental models, specify a `partition_by` clause in your model config.\ndbt will use an `insert overwrite` query to overwrite the partitions included in your query. Be sure to re-select _all_ of the relevant\ndata for a partition when using incremental models.\n\n```\n{{ config(\n materialized='incremental',\n partition_by=['date_day'],\n file_format='parquet'\n) }}\n\n/*\n Every partition returned by this query will be overwritten\n when this model runs\n*/\n\nselect\n date_day,\n count(*) as users\n\nfrom {{ ref('events') }}\nwhere date_day::date >= '2019-01-01'\ngroup by 1\n```\n\n### Reporting bugs and contributing code\n\n- Want to report a bug or request a feature? Let us know on [Slack](http://slack.getdbt.com/), or open [an issue](https://github.com/fishtown-analytics/dbt-spark/issues/new).\n\n## Code of Conduct\n\nEveryone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the [PyPA Code of Conduct](https://www.pypa.io/en/latest/code-of-conduct/).", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/fishtown-analytics/dbt-spark", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "dbt-spark", "package_url": "https://pypi.org/project/dbt-spark/", "platform": "", "project_url": "https://pypi.org/project/dbt-spark/", "project_urls": { "Homepage": "https://github.com/fishtown-analytics/dbt-spark" }, "release_url": "https://pypi.org/project/dbt-spark/0.13.0/", "requires_dist": null, "requires_python": "", "summary": "The SparkSQL plugin for dbt (data build tool)", "version": "0.13.0" }, "last_serial": 5482736, "releases": { "0.13.0": [ { "comment_text": "", "digests": { "md5": "b4985cce5174703043df23a701f7cce3", "sha256": "65d8d9ccfd5185cfaba1652bb732d69e25eda12dbafcdb67943615d3255e6242" }, "downloads": -1, "filename": "dbt_spark-0.13.0-py3.7.egg", "has_sig": false, "md5_digest": "b4985cce5174703043df23a701f7cce3", "packagetype": "bdist_egg", "python_version": "3.7", "requires_python": null, "size": 26236, "upload_time": "2019-07-03T17:12:07", "url": "https://files.pythonhosted.org/packages/6a/79/686f13b7bfa55ff80abc40c3db0a61f59fafba6c17e9b8fcebb153eed6bf/dbt_spark-0.13.0-py3.7.egg" }, { "comment_text": "", "digests": { "md5": "aba7d7199a6f4f76fcc8c0933cbc5a4d", "sha256": "d0c3255edadec5a2d423ca7fd20a4d2b0ba45c75fc0b73b554121a98f74c72c6" }, "downloads": -1, "filename": "dbt-spark-0.13.0.tar.gz", "has_sig": false, "md5_digest": "aba7d7199a6f4f76fcc8c0933cbc5a4d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12799, "upload_time": "2019-07-03T17:12:04", "url": "https://files.pythonhosted.org/packages/bb/37/fe34166ef27c5d71022ae27ec2445c8c0227b3f17bd5999e5893e6012ca8/dbt-spark-0.13.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b4985cce5174703043df23a701f7cce3", "sha256": "65d8d9ccfd5185cfaba1652bb732d69e25eda12dbafcdb67943615d3255e6242" }, "downloads": -1, "filename": "dbt_spark-0.13.0-py3.7.egg", "has_sig": false, "md5_digest": "b4985cce5174703043df23a701f7cce3", "packagetype": "bdist_egg", "python_version": "3.7", "requires_python": null, "size": 26236, "upload_time": "2019-07-03T17:12:07", "url": "https://files.pythonhosted.org/packages/6a/79/686f13b7bfa55ff80abc40c3db0a61f59fafba6c17e9b8fcebb153eed6bf/dbt_spark-0.13.0-py3.7.egg" }, { "comment_text": "", "digests": { "md5": "aba7d7199a6f4f76fcc8c0933cbc5a4d", "sha256": "d0c3255edadec5a2d423ca7fd20a4d2b0ba45c75fc0b73b554121a98f74c72c6" }, "downloads": -1, "filename": "dbt-spark-0.13.0.tar.gz", "has_sig": false, "md5_digest": "aba7d7199a6f4f76fcc8c0933cbc5a4d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12799, "upload_time": "2019-07-03T17:12:04", "url": "https://files.pythonhosted.org/packages/bb/37/fe34166ef27c5d71022ae27ec2445c8c0227b3f17bd5999e5893e6012ca8/dbt-spark-0.13.0.tar.gz" } ] }