{ "info": { "author": "", "author_email": "", "bugtrack_url": null, "classifiers": [], "description": ".. image:: https://travis-ci.org/capitalone/datacompy.svg?branch=master\n :target: https://travis-ci.org/capitalone/datacompy\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n :target: https://github.com/ambv/black\n\n=========\nDataComPy\n=========\n\nDataComPy is a package to compare two Pandas DataFrames. Originally started to\nbe something of a replacement for SAS's ``PROC COMPARE`` for Pandas DataFrames\nwith some more functionality than just ``Pandas.DataFrame.equals(Pandas.DataFrame)``\n(in that it prints out some stats, and lets you tweak how accurate matches have to be).\nThen extended to carry that functionality over to Spark Dataframes.\n\nQuick Installation\n==================\n\n::\n\n pip install datacompy\n\nPandas Detail\n=============\n\nDataComPy will try to join two dataframes either on a list of join columns, or\non indexes. If the two dataframes have duplicates based on join values, the\nmatch process sorts by the remaining fields and joins based on that row number.\n\nColumn-wise comparisons attempt to match values even when dtypes don't match.\nSo if, for example, you have a column with ``decimal.Decimal`` values in one\ndataframe and an identically-named column with ``float64`` dtype in another,\nit will tell you that the dtypes are different but will still try to compare the\nvalues.\n\nBasic Usage\n-----------\n\n.. code-block:: python\n\n from io import StringIO\n import pandas as pd\n import datacompy\n\n data1 = \"\"\"acct_id,dollar_amt,name,float_fld,date_fld\n 10000001234,123.45,George Maharis,14530.1555,2017-01-01\n 10000001235,0.45,Michael Bluth,1,2017-01-01\n 10000001236,1345,George Bluth,,2017-01-01\n 10000001237,123456,Bob Loblaw,345.12,2017-01-01\n 10000001239,1.05,Lucille Bluth,,2017-01-01\n \"\"\"\n\n data2 = \"\"\"acct_id,dollar_amt,name,float_fld\n 10000001234,123.4,George Michael Bluth,14530.155\n 10000001235,0.45,Michael Bluth,\n 10000001236,1345,George Bluth,1\n 10000001237,123456,Robert Loblaw,345.12\n 10000001238,1.05,Loose Seal Bluth,111\n \"\"\"\n\n df1 = pd.read_csv(StringIO(data1))\n df2 = pd.read_csv(StringIO(data2))\n\n compare = datacompy.Compare(\n df1,\n df2,\n join_columns='acct_id', #You can also specify a list of columns\n abs_tol=0, #Optional, defaults to 0\n rel_tol=0, #Optional, defaults to 0\n df1_name='Original', #Optional, defaults to 'df1'\n df2_name='New' #Optional, defaults to 'df2'\n )\n compare.matches(ignore_extra_columns=False)\n # False\n\n # This method prints out a human-readable report summarizing and sampling differences\n print(compare.report())\n\nSee docs for more detailed usage instructions and an example of the report output.\n\nThings that are happening behind the scenes\n-------------------------------------------\n\n- You pass in two dataframes (``df1``, ``df2``) to ``datacompy.Compare`` and a\n column to join on (or list of columns) to ``join_columns``. By default the\n comparison needs to match values exactly, but you can pass in ``abs_tol``\n and/or ``rel_tol`` to apply absolute and/or relative tolerances for numeric columns.\n\n - You can pass in ``on_index=True`` instead of ``join_columns`` to join on\n the index instead.\n\n- The class validates that you passed dataframes, that they contain all of the\n columns in `join_columns` and have unique column names other than that. The\n class also lowercases all column names to disambiguate.\n- On initialization the class validates inputs, and runs the comparison.\n- ``Compare.matches()`` will return ``True`` if the dataframes match, ``False``\n otherwise.\n\n - You can pass in ``ignore_extra_columns=True`` to not return ``False`` just\n because there are non-overlapping column names (will still check on\n overlapping columns)\n - NOTE: if you only want to validate whether a dataframe matches exactly or\n not, you should look at ``pandas.testing.assert_frame_equal``. The main\n use case for ``datacompy`` is when you need to interpret the difference\n between two dataframes.\n\n- Compare also has some shortcuts like\n\n - ``intersect_rows``, ``df1_unq_rows``, ``df2_unq_rows`` for getting\n intersection, just df1 and just df2 records (DataFrames)\n - ``intersect_columns()``, ``df1_unq_columns()``, ``df2_unq_columns()`` for\n getting intersection, just df1 and just df2 columns (Sets)\n\n- You can turn on logging to see more detailed logs.\n\n\n.. _spark-detail:\n\nSpark Detail\n============\n\nDataComPy's ``SparkCompare`` class will join two dataframes either on a list of join\ncolumns. It has the capability to map column names that may be different in each\ndataframe, including in the join columns. You are responsible for creating the\ndataframes from any source which Spark can handle and specifying a unique join\nkey. If there are duplicates in either dataframe by join key, the match process\nwill remove the duplicates before joining (and tell you how many duplicates were\nfound).\n\nAs with the Pandas-based ``Compare`` class, comparisons will be attempted even\nif dtypes don't match. Any schema differences will be reported in the output\nas well as in any mismatch reports, so that you can assess whether or not a\ntype mismatch is a problem or not.\n\nThe main reasons why you would choose to use ``SparkCompare`` over ``Compare``\nare that your data is too large to fit into memory, or you're comparing data\nthat works well in a Spark environment, like partitioned Parquet, CSV, or JSON\nfiles, or Cerebro tables.\n\nPerformance Implications\n------------------------\n\nSpark scales incredibly well, so you can use ``SparkCompare`` to compare\nbillions of rows of data, provided you spin up a big enough cluster. Still,\njoining billions of rows of data is an inherently large task, so there are a\ncouple of things you may want to take into consideration when getting into the\ncliched realm of \"big data\":\n\n* ``SparkCompare`` will compare all columns in common in the dataframes and\n report on the rest. If there are columns in the data that you don't care to\n compare, use a ``select`` statement/method on the dataframe(s) to filter\n those out. Particularly when reading from wide Parquet files, this can make\n a huge difference when the columns you don't care about don't have to be\n read into memory and included in the joined dataframe.\n* For large datasets, adding ``cache_intermediates=True`` to the ``SparkCompare``\n call can help optimize performance by caching certain intermediate dataframes\n in memory, like the de-duped version of each input dataset, or the joined\n dataframe. Otherwise, Spark's lazy evaluation will recompute those each time\n it needs the data in a report or as you access instance attributes. This may\n be fine for smaller dataframes, but will be costly for larger ones. You do\n need to ensure that you have enough free cache memory before you do this, so\n this parameter is set to False by default.\n\nBasic Usage\n-----------\n\n.. code-block:: python\n\n import datetime\n import datacompy\n from pyspark.sql import Row\n\n # This example assumes you have a SparkSession named \"spark\" in your environment, as you\n # do when running `pyspark` from the terminal or in a Databricks notebook (Spark v2.0 and higher)\n\n data1 = [\n Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555,\n date_fld=datetime.date(2017, 1, 1)),\n Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0,\n date_fld=datetime.date(2017, 1, 1)),\n Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None,\n date_fld=datetime.date(2017, 1, 1)),\n Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12,\n date_fld=datetime.date(2017, 1, 1)),\n Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None,\n date_fld=datetime.date(2017, 1, 1))\n ]\n\n data2 = [\n Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155),\n Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None),\n Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=1.0),\n Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12),\n Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0)\n ]\n\n base_df = spark.createDataFrame(data1)\n compare_df = spark.createDataFrame(data2)\n\n comparison = datacompy.SparkCompare(spark, base_df, compare_df, join_columns=['acct_id'])\n\n # This prints out a human-readable report summarizing differences\n comparison.report()\n\nUsing SparkCompare on EMR or standalone Spark\n---------------------------------------------\n\n1. Set proxy variables\n2. Create a virtual environment, if desired (``virtualenv venv; source venv/bin/activate``)\n3. Pip install datacompy and requirements\n4. Ensure your SPARK_HOME environment variable is set (this is probably ``/usr/lib/spark`` but may\n differ based on your installation)\n5. Augment your PYTHONPATH environment variable with\n ``export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python:$PYTHONPATH``\n (note that your version of py4j may differ depending on the version of Spark you're using)\n\n\nUsing SparkCompare on Databricks\n--------------------------------\n\n1. Clone this repository locally\n2. Create a datacompy egg by running ``python setup.py bdist_egg`` from the repo root directory.\n3. From the Databricks front page, click the \"Library\" link under the \"New\" section.\n4. On the New library page:\n a. Change source to \"Upload Python Egg or PyPi\"\n b. Under \"Upload Egg\", Library Name should be \"datacompy\"\n c. Drag the egg file in datacompy/dist/ to the \"Drop library egg here to upload\" box\n d. Click the \"Create Library\" button\n5. Once the library has been created, from the library page (which you can find in your /Users/{login} workspace),\n you can choose clusters to attach the library to.\n6. ``import datacompy`` in a notebook attached to the cluster that the library is attached to and enjoy!\n\nContributors\n------------\n\nWe welcome your interest in Capital One\u2019s Open Source Projects (the \"Project\").\nAny Contributor to the project must accept and sign a CLA indicating agreement to\nthe license terms. Except for the license granted in this CLA to Capital One and\nto recipients of software distributed by Capital One, you reserve all right, title,\nand interest in and to your contributions; this CLA does not impact your rights to\nuse your own contributions for any other purpose.\n\n- `Link to Individual CLA `_\n- `Link to Corporate CLA `_\n\nThis project adheres to the `Open Source Code of Conduct `_.\nBy participating, you are expected to honor this code.\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/capitalone/datacompy", "keywords": "", "license": "Apache-2.0", "maintainer": "", "maintainer_email": "", "name": "datacompy", "package_url": "https://pypi.org/project/datacompy/", "platform": "", "project_url": "https://pypi.org/project/datacompy/", "project_urls": { "Homepage": "https://github.com/capitalone/datacompy" }, "release_url": "https://pypi.org/project/datacompy/0.6.0/", "requires_dist": [ "pandas (>=0.19.0)", "numpy (>=1.11.3)", "six (>=1.10)", "enum34 (>=1.1.6) ; python_version < \"3.4\"" ], "requires_python": "", "summary": "Dataframe comparison in Python", "version": "0.6.0" }, "last_serial": 5317571, "releases": { "0.5.0": [ { "comment_text": "", "digests": { "md5": "1495948190c55bbab700a3c301efd4e0", "sha256": "20092aaffe3fe0ea0d5dffd37674e2873be44d3ac49ce278fb9622b521d1132c" }, "downloads": -1, "filename": "datacompy-0.5.0.tar.gz", "has_sig": false, "md5_digest": "1495948190c55bbab700a3c301efd4e0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23522, "upload_time": "2018-03-28T20:56:51", "url": "https://files.pythonhosted.org/packages/9f/11/ba9252475298df487cd0440b4aa6a7ebcb4929f312142b0a309ad86c0bd1/datacompy-0.5.0.tar.gz" } ], "0.5.1": [ { "comment_text": "", "digests": { "md5": "06a538b61671703d72821fdb5ef07bf8", "sha256": "14487a00035e5e5dc710afff3cbc261d95547b790948feb84a2b6828e8515d65" }, "downloads": -1, "filename": "datacompy-0.5.1.tar.gz", "has_sig": false, "md5_digest": "06a538b61671703d72821fdb5ef07bf8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24331, "upload_time": "2018-05-19T20:53:22", "url": "https://files.pythonhosted.org/packages/f5/8b/2b6100e57c2bbe2d269cf70d1de323068165873fd907f948a4ec8e512e19/datacompy-0.5.1.tar.gz" } ], "0.5.2": [ { "comment_text": "", "digests": { "md5": "81bf6ecd81b3be589b841b646cf6efc2", "sha256": "d2df4f8cfc6311981e94683eaa724e961205aaf01c108b175afc720768e79d81" }, "downloads": -1, "filename": "datacompy-0.5.2-py3-none-any.whl", "has_sig": false, "md5_digest": "81bf6ecd81b3be589b841b646cf6efc2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 27402, "upload_time": "2019-01-23T15:07:37", "url": "https://files.pythonhosted.org/packages/93/af/196cb8e8a111ff1810e6791ba02d73d38efd0c00fede90e82e86d77d5b0c/datacompy-0.5.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "512df69b08c81621b7d233ce483a75cc", "sha256": "cd7e0c23e1064c0b24921b54db6fa610307a11a78d522cf1b9d796a99e466ab0" }, "downloads": -1, "filename": "datacompy-0.5.2.tar.gz", "has_sig": false, "md5_digest": "512df69b08c81621b7d233ce483a75cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25321, "upload_time": "2019-01-23T15:07:38", "url": "https://files.pythonhosted.org/packages/14/cf/30b413cd9419402a0b82a2359390f1a3a5620883cc97cc63b42358a8e4d9/datacompy-0.5.2.tar.gz" } ], "0.6.0": [ { "comment_text": "", "digests": { "md5": "cbe56e66f55e460646685f24614c49b1", "sha256": "85400696811b473f99876a1baef16179608d09c7984908e7cabb480e92ac497d" }, "downloads": -1, "filename": "datacompy-0.6.0-py3-none-any.whl", "has_sig": false, "md5_digest": "cbe56e66f55e460646685f24614c49b1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 27401, "upload_time": "2019-01-25T17:48:20", "url": "https://files.pythonhosted.org/packages/c9/e2/853349690976b55fcfcf36e00e2901f70d700abb9901a1250c603ef2238b/datacompy-0.6.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8929481804b0bc0473ce2255d7be2604", "sha256": "ae266c96bd96e090f3daa91668094bab487fdceff97c65c8f05324c7dabc0358" }, "downloads": -1, "filename": "datacompy-0.6.0.tar.gz", "has_sig": false, "md5_digest": "8929481804b0bc0473ce2255d7be2604", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25345, "upload_time": "2019-01-25T17:48:21", "url": "https://files.pythonhosted.org/packages/3e/90/98f6911f324a3303ad8149e89b7a50c3071f52ff852b9937ecc5a962544e/datacompy-0.6.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cbe56e66f55e460646685f24614c49b1", "sha256": "85400696811b473f99876a1baef16179608d09c7984908e7cabb480e92ac497d" }, "downloads": -1, "filename": "datacompy-0.6.0-py3-none-any.whl", "has_sig": false, "md5_digest": "cbe56e66f55e460646685f24614c49b1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 27401, "upload_time": "2019-01-25T17:48:20", "url": "https://files.pythonhosted.org/packages/c9/e2/853349690976b55fcfcf36e00e2901f70d700abb9901a1250c603ef2238b/datacompy-0.6.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8929481804b0bc0473ce2255d7be2604", "sha256": "ae266c96bd96e090f3daa91668094bab487fdceff97c65c8f05324c7dabc0358" }, "downloads": -1, "filename": "datacompy-0.6.0.tar.gz", "has_sig": false, "md5_digest": "8929481804b0bc0473ce2255d7be2604", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25345, "upload_time": "2019-01-25T17:48:21", "url": "https://files.pythonhosted.org/packages/3e/90/98f6911f324a3303ad8149e89b7a50c3071f52ff852b9937ecc5a962544e/datacompy-0.6.0.tar.gz" } ] }