{ "info": { "author": "Arm Treasure Data", "author_email": "dev+pypi@treasure-data.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "td_pyspark\n== \n\n[Treasure Data](https://treasuredata.com) extension for using [pyspark](https://spark.apache.org/docs/latest/api/python/index.html).\n\n\n```aidl\n$ pip install td-pyspark \n```\n\n# Introduction\n\nFirst contact support@treasure-data.com to enable td-spark feature. This feature is disabled by default.\n\ntd-pyspark is a library to enable Python to access tables in Treasure Data.\nThe features of td_pyspark include:\n- Reading tables in Treasure Data as DataFrame \n- Writing DataFrames to Treasure Data\n- Submitting Presto queries and read the query results as DataFrames\n\n\nAs of June 2019, Spark 2.4.x + Scala 2.11 is supported. Spark 2.4.3 is preferred.\n\nFor more details, see also [td-spark FAQ](https://support.treasuredata.com/hc/en-us/articles/360001487167-Apache-Spark-Driver-td-spark-FAQs).\n\n## Quick Start with Docker\n\nYou can try td_pyspark using Docker without installing Spark nor Python.\n\nFirst create __td-spark.conf__ file and set your TD API KEY and site (us, jp, eu01) configurations:\n\n__td-spark.conf__\n```\nspark.td.apikey (Your TD API KEY)\nspark.td.site (Your site: us, jp, eu01)\nspark.serializer org.apache.spark.serializer.KryoSerializer\nspark.sql.execution.arrow.enabled true\n```\n\nLaunch pyspark Docker image. This image already has a pre-installed td_pyspark library:\n```python\n$ docker run -it -e TD_SPARK_CONF=td-spark.conf -v $(pwd):/opt/spark/work armtd/td-spark-pyspark:latest\nPython 3.6.6 (default, Aug 24 2018, 05:04:18)\n[GCC 6.4.0] on linux\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n19/06/13 19:33:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /__ / .__/\\_,_/_/ /_/\\_\\ version 2.4.3\n /_/\n\nUsing Python version 3.6.6 (default, Aug 24 2018 05:04:18)\nSparkSession available as 'spark'.\n2019-06-13 19:33:49.449Z debug [spark] Loading com.treasuredata.spark package - (package.scala:23)\n2019-06-13 19:33:49.486Z info [spark] td-spark version:1.2.0+31-d0f3a15e, revision:d0f3a15, build_time:2019-06-13T10:33:43.655-0700 - (package.scala:24)\n2019-06-13 19:33:50.310Z info [TDServiceConfig] td-spark site: us - (TDServiceConfig.scala:36)\n2019-06-13 19:33:51.877Z info [LifeCycleManager] [session:7ebc16af] Starting a new lifecycle ... - (LifeCycleManager.scala:187)\n2019-06-13 19:33:51.880Z info [LifeCycleManager] [session:7ebc16af] ======== STARTED ======== - (LifeCycleManager.scala:191)\n>>> \n```\n\nTry read a sample table by specifying a time range:\n```python\n>>> df = td.table(\"sample_datasets.www_access\").within(\"+2d/2014-10-04\").df()\n>>> df.show()\n2019-06-13 19:48:51.605Z info [TDRelation] Fetching the partition list of sample_datasets.www_access within time range:[2014-10-04 00:00:00Z,2014-10-06 00:00:00Z) - (TDRelation.scala:170)\n2019-06-13 19:48:51.950Z info [TDRelation] Retrieved 2 partition entries - (TDRelation.scala:176)\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|user| host| path| referer|code| agent|size|method| time|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|null|192.225.229.196| /category/software| -| 200|Mozilla/5.0 (Maci...| 117| GET|1412382292|\n|null|120.168.215.131| /category/software| -| 200|Mozilla/5.0 (comp...| 53| GET|1412382284|\n|null|180.198.173.136|/category/electro...| /category/computers| 200|Mozilla/5.0 (Wind...| 106| GET|1412382275|\n|null| 140.168.145.49| /item/garden/2832| /item/toys/230| 200|Mozilla/5.0 (Maci...| 122| GET|1412382267|\n|null| 52.168.78.222|/category/electro...| /item/games/2532| 200|Mozilla/5.0 (comp...| 73| GET|1412382259|\n|null| 32.42.160.165| /category/cameras|/category/cameras...| 200|Mozilla/5.0 (Wind...| 117| GET|1412382251|\n|null| 48.204.59.23| /category/software|/search/?c=Electr...| 200|Mozilla/5.0 (Maci...| 52| GET|1412382243|\n|null|136.207.150.227|/category/electro...| -| 200|Mozilla/5.0 (iPad...| 120| GET|1412382234|\n|null| 204.21.174.187| /category/jewelry| /item/office/3462| 200|Mozilla/5.0 (Wind...| 59| GET|1412382226|\n|null| 224.198.88.93| /category/office| /category/music| 200|Mozilla/4.0 (comp...| 46| GET|1412382218|\n|null| 96.54.24.116| /category/games| -| 200|Mozilla/5.0 (Wind...| 40| GET|1412382210|\n|null| 184.42.224.210| /category/computers| -| 200|Mozilla/5.0 (Wind...| 95| GET|1412382201|\n|null| 144.72.47.212|/item/giftcards/4684| /item/books/1031| 200|Mozilla/5.0 (Wind...| 65| GET|1412382193|\n|null| 40.213.111.170| /item/toys/1085| /category/cameras| 200|Mozilla/5.0 (Wind...| 65| GET|1412382185|\n|null| 132.54.226.209|/item/electronics...| /category/software| 200|Mozilla/5.0 (comp...| 121| GET|1412382177|\n|null| 108.219.68.64|/category/cameras...| -| 200|Mozilla/5.0 (Maci...| 54| GET|1412382168|\n|null| 168.66.149.218| /item/software/4343| /category/software| 200|Mozilla/4.0 (comp...| 139| GET|1412382160|\n|null| 80.66.118.103| /category/software| -| 200|Mozilla/4.0 (comp...| 92| GET|1412382152|\n|null|140.171.147.207| /category/music| /category/jewelry| 200|Mozilla/5.0 (Wind...| 119| GET|1412382144|\n|null| 84.132.164.204| /item/software/4783|/category/electro...| 200|Mozilla/5.0 (Wind...| 137| GET|1412382135|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\nonly showing top 20 rows\n>>> \n```\n\n\n# Usage\n\n_TDSparkContext_ is an entry point to access td_pyspark's functionalities. To create TDSparkContext, pass your SparkSession (spark) to TDSparkContext:\n```python\ntd = TDSparkContext(spark)\n```\n\n## Reading Tables as DataFrames\n\nTo read a table, use `td.table(table name)`:\n```python\ndf = td.table(\"sample_datasets.www_access\").df()\ndf.show()\n```\n\nTo change the context database, use `td.use(database_name)`:\n```python\ntd.use(\"sample_datasets\")\n# Accesses sample_datasets.www_access\ndf = td.table(\"www_access\").df()\n```\n\nBy calling `.df()` your table data will be read as Spark's DataFrame.\nThe usage of the DataFrame is the same with PySpark. See also [PySpark DataFrame documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame).\n\n### Specifying Time Ranges\n\nTreasure Data is a time series database, so reading recent data by specifying a time range is important to reduce the amount of data to be processed.\n`.within(...)` function can be used to specify a target time range in a concise syntax. \n`within` function accepts the same syntax used in [TD_INTERVAL function in Presto](https://support.treasuredata.com/hc/en-us/articles/360001450828-Supported-Presto-and-TD-Functions#TD_INTERVAL).\n\nFor example, to read the last 1 hour range of data, use `within(\"-1h\")`:\n```python\ntd.table(\"tbl\").within(\"-1h\").df()\n```\n\nYou can also read the last day's data:\n```python\ntd.table(\"tbl\").within(\"-1d\").df()\n```\n\nYou can also specify an _offset_ of the relative time range. This example reads the last days's data beginning from 7 days ago:\n```python\ntd.table(\"tbl\").within(\"-1d/-7d\").df()\n```\n\nIf you know an exact time range, `within(\"(start time)/(end time)\")` is useful:\n```python\n>>> df = td.table(\"sample_datasets.www_access\").within(\"2014-10-04/2014-10-05\").df()\n>>> df.show()\n2019-06-13 20:12:01.400Z info [TDRelation] Fetching the partition list of sample_datasets.www_access within time range:[2014-10-04 00:00:00Z,2014-10-05 00:00:00Z) - (TDRelation.scala:170)\n...\n```\n\nSee [this doc](https://support.treasuredata.com/hc/en-us/articles/360001450828-Supported-Presto-and-TD-Functions#TD_INTERVAL) for more examples of interval strings. \n\n\n### Submitting Presto Queries\n\nIf your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult.\nIn this case, you can utilize Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark:\n\n```python\n>>> q = td.presto(\"select code, * from sample_datasets.www_access\")\n>>> q.show()\n2019-06-13 20:09:13.245Z info [TDPrestoJDBCRDD] - (TDPrestoRelation.scala:106)\nSubmit Presto query:\nselect code, count(*) cnt from sample_datasets.www_access group by 1\n+----+----+\n|code| cnt|\n+----+----+\n| 200|4981|\n| 500| 2|\n| 404| 17|\n+----+----+\n```\n\nThe query result is represented as a DataFrame.\n\nTo run non query statements (e.g., INSERT INTO, CREATE TABLE, etc.) use `execute_presto(sql)`:\n```python\ntd.execute_presto(\"CREATE TABLE IF NOT EXISTS A(time bigint, id varchar)\")\n```\n\n### Using SparkSQL\n\nTo use tables in Treaure Data inside Spark SQL, create a view with `df.createOrRepalceTempView(...)`:\n```python\n# Read TD table as a DataFrame\ndf = td.table(\"mydb.test1\").df()\n# Register the DataFrame as a view\ndf.createOrReplaceTempView(\"test1\")\n\nspark.sql(\"SELECT * FROM test1\").show()\n```\n\n## Create or Drop Databases and Tables\n\nCreate a new table or database:\n```python\ntd.create_database_if_not_exists(\"mydb\")\ntd.create_table_if_not_exists(\"mydb.test1\")\n```\n\nDelete unnecessary tables:\n```python\ntd.drop_table_if_exists(\"mydb.test1\")\ntd.drop_database_if_exists(\"mydb\")\n```\n\nYou can also check the presence of a table:\n```python\ntd.table(\"mydb.test1\").exists() # True if the table exists\n```\n\n## Create User-Defined Partition Tables\n\nUser-defined partitioning ([UDP](https://support.treasuredata.com/hc/en-us/articles/360009798714-User-Defined-Partitioning-for-Presto)) is useful if \nyou know a column in the table that has unique identifiers (e.g., IDs, category values).\n\nYou can create a UDP table partitioned by id (string type column) as follows: \n```python\ntd.create_udp_s(\"mydb.user_list\", \"id\")\n```\n\nTo create a UDP table, partitioned by Long (bigint) type column, use `td.create_udp_l`:\n```python\ntd.create_udp_l(\"mydb.departments\", \"dept_id\")\n```\n\n## Uploading DataFrames to Treasure Data\n\nTo save your local DataFrames as a table, `td.insert_into(df, table)` and `td.create_or_replace(df, table)` can be used:\n\n```python\n# Insert the records in the input DataFrame to the target table:\ntd.insert_into(df, \"mydb.tbl1\")\n\n# Create or replace the target table with the content of the input DataFrame:\ntd.create_or_replace(df, \"mydb.tbl2\")\n```\n\n## Running PySpark jobs with spark-submit\n\nTo submit your PySpark script to a Spark cluster, you will need the following files:\n- __td-spark.conf__ file that describes your TD API key and `spark.td.site` (See above). \n- __td_pyspark.py__ \n - Check the file location using `pip show -f td-pyspark`, and copy td_pyspark.py to your favorite location\n- __td-spark-assembly.jar__\n - Get the latest version from [Download](https://support.treasuredata.com/hc/en-us/articles/360000716627-TD-Spark-Driver-td-spark-) page. \n- Pre-build Spark \n - [Download Spark 2.4.x](https://spark.apache.org/downloads.html) with Hadoop 2.7.x (built for Scala 2.11) \n - Extract the downloaded archive. This folder location will be your `$SPARK_HOME`.\n\n\nHere is an example PySpark application code:\n__my_app.py__\n```python\nimport td_pyspark\nfrom pyspark.sql import SparkSession\n\n# Create a new SparkSession \nspark = SparkSession\\\n .builder\\\n .appName(\"myapp\")\\\n .getOrCreate()\n\n# Create TDSparkContext\ntd = td_pyspark.TDSparkContext(spark)\n\n# Read the table data within -1d (yesterday) range as DataFrame\ndf = td.table(\"sample_datasets.www_access\").within(\"-1d\").df()\ndf.show()\n```\n\nTo run `my_app.py` use spark-submit by specifying the necessary files mentioned above:\n```bash\n# Launching PySpark with the local mode\n$ ${SPARK_HOME}/bin/spark-submit --master \"local[4]\"\\\n --driver-class-path td-spark-assembly.jar\\\n --properties-file=td-spark.conf\\\n --py-files td_pyspark.py\\\n my_app.py\n```\n\n`local[4]` means running a Spark cluster locally using 4 threads.\n\nTo use a remote Spark cluster, specify `master` address, e.g., `--master=spark://(master node IP address):7077`.\n\n## Using td-spark assembly included in the PyPI package.\n\nThe package contains pre-built binary of td-spark so that you can add it into the classpath as default.\n`TDSparkContextBuilder.default_jar_path()` returns the path to the default td-spark-assembly.jar file.\nPassing the path to `jars` method of TDSparkContextBuilder will automatically build the SparkSession including the default jar.\n\n\n```python\nimport td_pyspark\nfrom pyspark.sql import SparkSession\n\nbuilder = SparkSession\\\n .builder\\\n .appName(\"td-pyspark-app\")\n\ntd = td_pyspark.TDSparkContextBuilder(builder)\\\n .apikey(\"XXXXXXXXXXXXXX\")\\\n .jars(TDSparkContextBuilder.default_jar_path())\\\n .build()\n```\n\n# For Developers\n\nRunning pyspark with td_pyspark:\n\n```bash\n$ ${SPARK_HOME}/bin/spark-submit --master \"local[4]\" --driver-class-path td-spark-assembly.jar --properties-file=td-spark.conf --py-files td_pyspark.py your_app.py\n```\n\n## How To Publish td_pyspark \n\n### Prerequisites \n\n[Twine](https://pypi.org/project/twine/) is a secure utility to publish the python package. It's commonly used to publish Python package to PyPI.\nFirst you need to install the package in advance.\n\n```bash\n$ pip install twine\n```\n\nHaving the configuration file for PyPI credential may be useful.\n\n```\n$ cat << 'EOF' > ~/.pypirc \n[distutils]\nindex-servers =\n pypi\n pypitest\n\n[pypi]\nrepository=https://upload.pypi.org/legacy/\nusername=\npassword=\n\n[pypitest]\nrepository=https://test.pypi.org/legacy/\nusername=\n\npassword=\nEOF\n```\n\n### Build Package\n\nBuild the package in the raw source code and wheel format.\n\n```\n$ make package\n```\n\n### Publish Package\n\nUpload the package to the test repository first.\n\n```\n$ twine upload \\\n --repository pypitest \\\n dist/*\n```\n\nIf you do not find anything wrong in the test repository, then it's time to publish the package.\n\n```\n$ twine upload \\\n --repository pypi \\\n dist/*\n```\n\n\n`TDSparkContextBuilder` automatically set site specific information.\n\n\n## Customize API endpoints\n\nUse `TDSparkContextBuilder` to specify differnt API endpoints (e.g., development API):\n```python\nimport td_pyspark\nfrom pyspark.sql import SparkSession\n\nbuilder = SparkSession\\\n .builder\\\n .appName(\"td-pyspark-app\")\n\ntd = td_pyspark.TDSparkContextBuilder(builder)\\\n .apikey(\"XXXXXXXXXXXXXX\")\\\n .api_endpoint(\"api.treasuredata.com\")\\\n .build()\n\n# Read the table data within -1d (yesterday) range as DataFrame\ndf = td.table(\"sample_datasets.www_access\")\\\n .within(\"-1d\")\\\n .df()\n\ndf.show()\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://support.treasuredata.com/hc/en-us/sections/360000317028-Data-Science-and-SQL-Tools", "keywords": "Spark PySpark TreasureData", "license": "Apache 2", "maintainer": "", "maintainer_email": "", "name": "td-pyspark", "package_url": "https://pypi.org/project/td-pyspark/", "platform": "", "project_url": "https://pypi.org/project/td-pyspark/", "project_urls": { "Homepage": "https://support.treasuredata.com/hc/en-us/sections/360000317028-Data-Science-and-SQL-Tools" }, "release_url": "https://pypi.org/project/td-pyspark/19.9.0/", "requires_dist": [ "pyspark (>=2.4.0) ; extra == 'spark'" ], "requires_python": "", "summary": "Treasure Data extension for pyspark", "version": "19.9.0" }, "last_serial": 5817812, "releases": { "19.5.0": [ { "comment_text": "", "digests": { "md5": "3dada29e9e97c137c5b15a0adfba0d6d", "sha256": "71ea79f6a84ae7cc32ba5923694366416254a87789b97e5c63cbe6b49aea0847" }, "downloads": -1, "filename": "td_pyspark-19.5.0-py3-none-any.whl", "has_sig": false, "md5_digest": "3dada29e9e97c137c5b15a0adfba0d6d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 3474, "upload_time": "2019-06-10T04:40:01", "url": "https://files.pythonhosted.org/packages/44/f6/7781921289bbb439c9784acd5af7098f1699b99b6a5a885e8357c2b405a1/td_pyspark-19.5.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1adec893a40645892f47687901c195f3", "sha256": "85c7b71746f600b21b62905625ff2808fe1b2dabf1c89fda29a7953c0f79e8b8" }, "downloads": -1, "filename": "td_pyspark-19.5.0.tar.gz", "has_sig": false, "md5_digest": "1adec893a40645892f47687901c195f3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3229, "upload_time": "2019-06-10T04:40:03", "url": "https://files.pythonhosted.org/packages/1e/ec/566cec7a476faa73812d592af32c3e72c9a272ae2fc5492bd4717876eeca/td_pyspark-19.5.0.tar.gz" } ], "19.7.0": [ { "comment_text": "", "digests": { "md5": "ba185616d590dc06aadd9758ce72cfae", "sha256": "f88b10f7b7bab8fc6d0afa0ba42ae6dc1081d631faaba683b53a6a255088d215" }, "downloads": -1, "filename": "td_pyspark-19.7.0-py3-none-any.whl", "has_sig": false, "md5_digest": "ba185616d590dc06aadd9758ce72cfae", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 45597589, "upload_time": "2019-07-02T02:59:21", "url": "https://files.pythonhosted.org/packages/85/c0/1257fac3ed8b1375554ecc2151e5efe73095693248630f140916bc66895e/td_pyspark-19.7.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e4710537c1503051c42396cf27a49279", "sha256": "629da3e4e6c03d9ca33c42c0bb0cdaf13d56693ee41d643211e9bfb98a1d7f2b" }, "downloads": -1, "filename": "td_pyspark-19.7.0.tar.gz", "has_sig": false, "md5_digest": "e4710537c1503051c42396cf27a49279", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22806668, "upload_time": "2019-07-02T02:59:34", "url": "https://files.pythonhosted.org/packages/12/39/b66e20b65462fd702ef6d28d2c2248820c4de09e387187aa27f3f0d9328a/td_pyspark-19.7.0.tar.gz" } ], "19.9.0": [ { "comment_text": "", "digests": { "md5": "653929feccce51e5299258f71219f689", "sha256": "3bf6be6d89ea3284d21d4a24ae8e97e44bfbad989b459b046793e4106e2c1cd4" }, "downloads": -1, "filename": "td_pyspark-19.9.0-py3-none-any.whl", "has_sig": false, "md5_digest": "653929feccce51e5299258f71219f689", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 45820482, "upload_time": "2019-09-12T02:08:48", "url": "https://files.pythonhosted.org/packages/aa/6e/3288fe1409f71204ee3b96382a4bcca900912dc77a5a0cef8b55fcab23ea/td_pyspark-19.9.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cb042fa5e86d69452f5838d860b6b848", "sha256": "14d7dd49ecdb818d1579c63812729d5eed7393da50456004b3560061882fb795" }, "downloads": -1, "filename": "td_pyspark-19.9.0.tar.gz", "has_sig": false, "md5_digest": "cb042fa5e86d69452f5838d860b6b848", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23025434, "upload_time": "2019-09-12T02:08:55", "url": "https://files.pythonhosted.org/packages/f4/fc/4e0149a6c44d58fbbd0c138f025c2966f209dd3bf4bd61443f1d3f3d3e28/td_pyspark-19.9.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "653929feccce51e5299258f71219f689", "sha256": "3bf6be6d89ea3284d21d4a24ae8e97e44bfbad989b459b046793e4106e2c1cd4" }, "downloads": -1, "filename": "td_pyspark-19.9.0-py3-none-any.whl", "has_sig": false, "md5_digest": "653929feccce51e5299258f71219f689", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 45820482, "upload_time": "2019-09-12T02:08:48", "url": "https://files.pythonhosted.org/packages/aa/6e/3288fe1409f71204ee3b96382a4bcca900912dc77a5a0cef8b55fcab23ea/td_pyspark-19.9.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cb042fa5e86d69452f5838d860b6b848", "sha256": "14d7dd49ecdb818d1579c63812729d5eed7393da50456004b3560061882fb795" }, "downloads": -1, "filename": "td_pyspark-19.9.0.tar.gz", "has_sig": false, "md5_digest": "cb042fa5e86d69452f5838d860b6b848", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23025434, "upload_time": "2019-09-12T02:08:55", "url": "https://files.pythonhosted.org/packages/f4/fc/4e0149a6c44d58fbbd0c138f025c2966f209dd3bf4bd61443f1d3f3d3e28/td_pyspark-19.9.0.tar.gz" } ] }