{ "info": { "author": "Steve Whalen", "author_email": "sjwhalen@yahoo.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# py-dataframe-show-reader\n\n``py-dataframe-show-reader`` is a library that reads the output of an Apache\nSpark DataFrame.show() statement into a PySpark DataFrame.\n\nThe primary intended use of the functions in this library is to be used to\nenable writing more concise and easy-to-read tests of data transformations than\nwould otherwise be possible.\n\nImagine we are working on a Python method that uses PySpark to perform a data\ntransformation that is not terribly complex, but complex enough that we would\nlike to verify that it performs as intended. For example, consider this\nfunction that accepts a PySpark Dataframe containing daily sales figures and\nreturns a Dataframe containing containing weekly sales summaries:\n\n```python\ndef summarize_weekly_sales(df_to_average: DataFrame):\n return (df_to_average\n .groupby(f.weekofyear('date').alias('week_of_year'))\n .agg(f.avg('units_sold').alias('avg_units_sold'),\n f.sum('gross_sales').alias('gross_sales')))\n```\n\nThis is not a complex method, but perhaps we would like to verify that:\n\n1. Dates are in fact grouped into different weeks.\n1. The units sold are actually averaged and not summed.\n1. Gross sales are actually summed and not averaged.\n\nA unit testing purist might argue that each of these assertions should be\ncovered by a separate test method, but there are at least two reasons why one\nmight choose not to go that route.\n\n1. Practical experience tells us that there is detectable overhead incurred for\neach separate PySpark transformation test, so we may want to limit the number\nof separate tests in order to keep our test suite as a whole running in a\nreasonable amount of time.\n\n1. When working with sets of data as we do in Spark or SQL, particularly when\nusing aggregate, grouping and window functions, sometimes there are\ninteractions between different rows that are easy to overlook, and tweaking a\nquery to fix an aggregate function like a summation might inadvertently break\nthe intended behavior of a windowing function in the query, and a change to the\nquery might allow a summation-only unit test to pass while allowing broken\nwindow function behavior to go undetected because we don't think to also update\nthe window-function-only unit test. \n\nIf we accept that we'd like to use a single test to verify the three\nrequirements of our query listed above, we're going to need three rows in our\ninput DataFrame.\n\nUsing unadorned pytest, our test might look like this:\n\n```python\ndef test_without_dataframe_show_reader(spark_session: SparkSession):\n input_rows = [\n Row(\n date=datetime(2019, 1, 1),\n units_sold=10,\n gross_sales=100,\n ),\n Row(\n date=datetime(2019, 1, 2),\n units_sold=20,\n gross_sales=200,\n ),\n Row(\n date=datetime(2019, 1, 8),\n units_sold=80,\n gross_sales=800,\n ),\n ]\n input_df = spark_session.createDataFrame(input_rows)\n\n result = summarize_weekly_sales(input_df).collect()\n assert 2 == len(result)\n assert 1 == result[0]['week_of_year']\n assert 15 == result[0]['avg_units_sold']\n assert 300 == result[0]['gross_sales']\n```\n\nUsing the DataFrame show reader, our test could look like this instead:\n\n```python\ndef test_using_dataframe_show_reader(spark_session: SparkSession):\n input_df = show_output_to_df(\"\"\"\n +-------------------+----------+-----------+\n |date |units_sold|gross_sales|\n +-------------------+----------+-----------+\n |2019-01-01 00:00:00|10 |100 |\n |2019-01-02 00:00:00|20 |200 |\n |2019-01-08 00:00:00|80 |800 |\n +-------------------+----------+-----------+\n \"\"\", spark_session)\n\n result = summarize_weekly_sales(input_df).collect()\n assert 2 == len(result)\n assert 1 == result[0]['week_of_year']\n assert 15 == result[0]['avg_units_sold']\n assert 300 == result[0]['gross_sales']\n```\n\nIn the second test example, the ``show_output_to_df`` function accepts as input\na string containing the output of a Spark DataFrame.show() call and\n\"rehydrates\" it into a new Spark DataFrame instance that can be used for\ntesting.\n\nIn the first version, the setup portion of the test contains eighteen lines,\nand it may take a few moments to digest the contents of the input rows.\nIn the second version, the setup portion contains nine lines and displays the\ninput data in a more concise tabular form that may be easier for other\nprogrammers (and our future selves) to digest when we need to maintain this\ncode down the road.\n\nIf the method under test were more complicated and required more rows and/or\ncolumns in order to adequately test, the length of the first test format would\ngrow much more quickly than that of the test using the DataFrame Show Reader.\n\n## Running the Tests\n\n1. Clone the git repo.\n1. ``cd`` into the root level of the repo.\n1. At a terminal command line, run ``pytest``\n\n## Who Maintains DataFrame Show Reader?\n\nDataFrame Show Reader is the work of the community. The core committers and\nmaintainers are responsible for reviewing and merging PRs, as well as steering\nconversation around new feature requests. If you would like to become a\nmaintainer, please contact us.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/internetsystemsgroup/py-dataframe-show-reader", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "py-dataframe-show-reader", "package_url": "https://pypi.org/project/py-dataframe-show-reader/", "platform": "", "project_url": "https://pypi.org/project/py-dataframe-show-reader/", "project_urls": { "Homepage": "https://github.com/internetsystemsgroup/py-dataframe-show-reader" }, "release_url": "https://pypi.org/project/py-dataframe-show-reader/1.0.0/", "requires_dist": [ "pyspark (>=2)" ], "requires_python": "", "summary": "Reads the output of a DataFrame.show() statement into a DataFrame", "version": "1.0.0" }, "last_serial": 5888767, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "f49397693f97c1709f10c794613662b2", "sha256": "071b0e83f826c0ea987c60c926a598359e1c3448cae57874298478b0695caeeb" }, "downloads": -1, "filename": "py_dataframe_show_reader-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "f49397693f97c1709f10c794613662b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13079, "upload_time": "2019-09-26T06:08:41", "url": "https://files.pythonhosted.org/packages/a6/de/590863cd4b537a4457ff3f7d09749ae13440c01da6ae0e4b99bbbfec0ae5/py_dataframe_show_reader-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cc95949bc271af1e5a7a0b3d9f006d44", "sha256": "11df3151ad6983b6e0ac3e0a4191a220d03b3e76cae49f23a1223e221b3a2c27" }, "downloads": -1, "filename": "py-dataframe-show-reader-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cc95949bc271af1e5a7a0b3d9f006d44", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8497, "upload_time": "2019-09-26T06:08:44", "url": "https://files.pythonhosted.org/packages/0f/38/61d7992fc853f16b25abdb7ca028ca197220383c91ed24bf4d443a19604b/py-dataframe-show-reader-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f49397693f97c1709f10c794613662b2", "sha256": "071b0e83f826c0ea987c60c926a598359e1c3448cae57874298478b0695caeeb" }, "downloads": -1, "filename": "py_dataframe_show_reader-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "f49397693f97c1709f10c794613662b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13079, "upload_time": "2019-09-26T06:08:41", "url": "https://files.pythonhosted.org/packages/a6/de/590863cd4b537a4457ff3f7d09749ae13440c01da6ae0e4b99bbbfec0ae5/py_dataframe_show_reader-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cc95949bc271af1e5a7a0b3d9f006d44", "sha256": "11df3151ad6983b6e0ac3e0a4191a220d03b3e76cae49f23a1223e221b3a2c27" }, "downloads": -1, "filename": "py-dataframe-show-reader-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cc95949bc271af1e5a7a0b3d9f006d44", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8497, "upload_time": "2019-09-26T06:08:44", "url": "https://files.pythonhosted.org/packages/0f/38/61d7992fc853f16b25abdb7ca028ca197220383c91ed24bf4d443a19604b/py-dataframe-show-reader-1.0.0.tar.gz" } ] }