{ "info": { "author": "Martin Grund", "author_email": "grundprinzip+pip@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "pyxplorer -- Easy Interactive Data Profiling for Big Data (and Small Data)\n--------------------------------------------------------------------------\n\nThe goal of pyxplorer is to provide a simple tool that allows interactive\nprofiling of datasets that are accessible via a SQL like interface. The only\nrequirement to run data profiling is that you are able to provide a Python\nDBAPI like interface to your data source and the data source is able to\nunderstand simplistic SQL queries.\n\nI built this piece of software while trying to get a better understanding of\ndata distribution in a massive several hundred million record large dataset.\nDepending on the size of the dataset and the query engine the resposne time\ncan ranging from seconds (Impala) to minutes (Hive) or even hourse (MySQL)\n\nThe typical use case is to use ```pyxplorer``` interactively from an iPython\nNotebook or iPython shell to incrementally extract information about your data.\n\nUsage\n------\n\nImagine that you are provided with access to a huge Hive/Impala database on\nyour very own Hadoop cluster and you're asked to profile the data to get a\nbetter understanding for performing more specific data science later on.::\n\n import pyxplorer as pxp\n from impala.dbapi import connect\n conn = connect(host='impala_server', port=21050)\n\n db = pxp.Database(\"default\", conn)\n db.tables()\n\nThis simple code gives you access to all the tables in this database. So let's\nassume the result shows a ```sales_orders``` table, what can we do now?::\n\n orders = db[\"sales_orders\"]\n orders.size() # 100M\n orders.columns() # [ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id, ...]\n\nOk, if we have so many columns, what can we find out about a single column?::\n\n orders.ol_d_id.min() # 1\n orders.ol_d_id.max() # 9999\n orders.ol_d_id.dcount() # 1000\n\nAnd like this there are some more key-figures about the data like uniqueness,\nconstancy, most and least frequent values and distribution.\n\nIn some cases, where it makes sense, the output of a method call will not be a\nsimple array or list but directly a Pandas dataframe to facilitate plotting\nand further analysis.\n\nYou will find an easier to digest tutorial here:\n\n * http://nbviewer.ipython.org/github/grundprinzip/pyxplorer/blob/master/pyxplorer_stuff.ipynb\n\n\nSupported Features\n-------------------\n\n * Column Count (Database / Table)\n * Table Count\n * Tuple Count (Database / Table)\n * Min / Max\n * Most Frequent / Least Frequent\n * Top-K Most Frequent / Top-K Least Frequent\n * Top-K Value Distribution (Database / Table )\n * Uniqueness\n * Constancy\n * Distinc Value Count\n\n\nSupported Platforms\n--------------------\n\nThe following platforms are typically tested while using ```pyxplorer```\n\n * Hive\n * Impala\n * MySQL\n\n\nDependencies\n-------------\n\n * pandas\n * phys2 for Hive based loading of data sets\n * pympala for connecting to Impala\n * snakebite for loading data from HDFS to Hive", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/grundprinzip/pyxplorer", "keywords": null, "license": "LICENSE", "maintainer": null, "maintainer_email": null, "name": "pyxplorer", "package_url": "https://pypi.org/project/pyxplorer/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/pyxplorer/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/grundprinzip/pyxplorer" }, "release_url": "https://pypi.org/project/pyxplorer/0.1.0/", "requires_dist": null, "requires_python": null, "summary": "Simple Big Data Profiling", "version": "0.1.0" }, "last_serial": 1152131, "releases": { "0.1.0": [] }, "urls": [] }