{ "info": { "author": "Zach Kurtz", "author_email": "zkurtz@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "[![Build Status](https://travis-ci.org/zkurtz/shmistogram.svg?branch=master)](https://travis-ci.org/zkurtz/shmistogram)\n# shmistogram\n\nThe shmistogram is a better histogram. Key differences include\n\n- emphasizing singular modalities (i.e. point masses) with a separate multinomial distribution\n- estimating density with better accuracy and fewer bins than a histogram \nby hierarchically grouping points into variable-width bins\n\nSuppose we simulate draws from a triangular distribution (the 'crowd'), \nsupplemented with a couple of mode points ('loners'), and some null values:\n\n```python\nfrom matplotlib import pyplot as plt\nimport numpy as np\nimport shmistogram as sh\n\n# Simulate a mixture of a uniform distribution mixed with a few point masses\nnp.random.seed(0)\ncrowd = np.random.triangular(-10, -10, 70, size=500)\nloners = np.array([0]*40 + [42]*20)\nnull = np.array([np.nan]*100)\ndata = np.concatenate((crowd, loners, null))\n\nfig, axes = plt.subplots(1, 2)\n\n# Build a standard histogram with matplotlib.pyplot.hist defaults\nsh.plot.standard_histogram(data[~np.isnan(data)], ax=axes[0], name='mixed data')\n\n# Build a shmistogram\nshm = sh.Shmistogram(data)\nshm.plot(ax=axes[1], name='mixed data')\n\nfig.tight_layout()\n```\n\n![](doc/chunks/mixed.png?raw=true \"title\")\n\nThe histogram obscures the point masses somewhat and says nothing about missing values. \nBy contrast, the shmistogram uses red line segments to emphasize the point masses, and\nthe legend bar highlights the relative portions of the data in the crowd versus\nthe point masses versus the null values.\n\n## Installation\n\n- install python 3.6+\n- `pip install git+https://github.com/zkurtz/shmistogram.git#egg=shmistogram`\n- test your installation by running [demo.py](demo/demo.py)\n\n\n## Details\n\n### Default behavior\n\nGiven a 1-D array of numeric (or `np.nan`) values `data`, the shmistogram \n`shmistogram.Shmistogram(data)` \n- counts every unique value\n- splits the data into as many as 3 subsets:\n - `np.nan`\n - \"Loners\" are points with a count above the threshold set by the\n argument `loner_min_count`. Shmistogram sets this dynamically by default\n as a somewhat log-linear function of `len(data)`. With 100 points, \n the threshold is 8; with 100,000 it is 18.\n - The \"crowd\" is all remaining points.\n- bins the \"crowd\" using a density estimation tree.\n\nCalling the plot method on the resulting object displays all components\nof the distribution on a single figure.\n\n### Why shmistogram?\n\n#### Use case 1: Exploratory data analysis\n\nA shmistogram can be more informative than a histogram by separating \ncontinous and discrete variation:\n- inconsistent rounding any continuous variable can induce a mixture of point masses and relatively continuous observations\n- \"age of earning first driver's license\" plausibly has structural modes at the \nlegal minimum (which may vary by state) and otherwise vary continuously\n\n#### Use case 2: Scalable, generative density estimation\n\nThe shmistogram scales approximately as O(n log(n)) with default settings \n(see [speed_testing.ipynb](demo/speed_testing.ipynb)). \nThe resulting density model is easy to sample from, as a mixture of \na piecewise uniform \ndistribution and a multinomial distribution. Such a simple\nestimator works well as one of the required inputs of the CADE density \nestimation algorithm for high dimensional \nand mixed continuous/categorical data (see [pydens](https://github.com/zkurtz/pydens)).\n\nThe shmistogram's adaptive bin width leads to a higher-fidelity representation of \ncomplicated distributions without substantially increasing the number of bins.\nThis is not a new idea, and shmistogram wraps multiple binning\nmethods that the user can choose from. See \n[binning_methods.ipynb](demo/binning_methods.ipynb) for details.\n\n## Binning\n\nThe default binning algorithm uses a [binary density estimation tree](shmistogram/det/__init__.py) \nto iteratively split the data into smaller bins. The split location (within a bin/leaf) \nmaximizes a penalized improvement in the deviance (i.e. in-sample negative log likelihood).\nThe penalty reflects\n- a hard `min_data_in_leaf` constraint. This minimum currently defaults to 3\n- a soft penalty on bins with few observations\n\nWe choose the bin to split on as the bin for which splitting produces the greatest \npenalized improvement. Splits proceed as long as the deviance improvement exceeds \nthe number of leaves. This approach is inspired by the Akaike information criterion \n(AIC), although this may be an abuse of the criterion in the sense that we're using \nit as part of a greedy iterative procedure instead of using it to compare fully-formed models. \n\nThe variable-width binning algorithms of \n[bayesian block representations](https://arxiv.org/pdf/1207.5578.pdf) \nprovide an alternative to our default binning algorithm. See [demo](demo/bayesian_blocks.ipynb) for\nan example. See also\n[Python Perambulations](https://jakevdp.github.io/blog/2012/09/12/dynamic-programming-in-python/) \nfor a light conceptual introduction to Bayesian blocks.\n\n## Wishlist\n\n**Clarify the objective:** There is a tension between optimizing a binner for \n(a) visualization purposes, such as avoiding tall narrow bins to minimize \nwhite space, or adjusting the average bin width to tell a particular story \nand (b) minimizing a formal measure of estimation accuracy such as the \nexpectation of deviance \n(taken over future observations from the true distribution). We should\noffer guidance on which binning method tends to be most effective\nfor each of these goals.\n\n**Optimize speed** for the default method. Scalability is a big part of the\nmotivation for such a simple model, but the current implementation is\nfar from optimal.\n\n**Compare/contrast/harmonize** our binning methods with the literature:\n- [density estimation trees](https://mlpack.org/papers/det.pdf) \nsuch as [this](https://gitlab.cern.ch/landerli/density-estimation-trees)\n- [distribution element trees](https://arxiv.org/pdf/1610.00345.pdf) such as \n[detpack](https://github.com/cran/detpack/blob/master/R/det1.R). See\n[detpack_example.R](demo/detpack_example.R) for a simple variable-width binner.\n- [Efficient Density Estimation via Piecewise Polynomial \nApproximation](https://arxiv.org/pdf/1305.3207.pdf).\n\n\n## Disclaimer\n\nThis repo is young, has practically no unit tests, and should be expected to change substantially. Use with caution.\n\n## License\n\nThis project is licensed under the terms of the MIT license. See LICENSE for additional details.\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/zkurtz/shmistogram", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "shmistogram", "package_url": "https://pypi.org/project/shmistogram/", "platform": "", "project_url": "https://pypi.org/project/shmistogram/", "project_urls": { "Homepage": "https://github.com/zkurtz/shmistogram" }, "release_url": "https://pypi.org/project/shmistogram/0.2.4/", "requires_dist": [ "astropy", "matplotlib", "pandas", "numpy", "pytz", "scipy", "six" ], "requires_python": "", "summary": "Piecewise-uniform univariate density estimation and visualization", "version": "0.2.4" }, "last_serial": 5138604, "releases": { "0.2.4": [ { "comment_text": "", "digests": { "md5": "2a5ca51a2dd7db7afa3a571b394cbe11", "sha256": "11ebb198e81ec45a92f6f51499fff27423e270046e26a86790a94491e6d10211" }, "downloads": -1, "filename": "shmistogram-0.2.4-py3-none-any.whl", "has_sig": false, "md5_digest": "2a5ca51a2dd7db7afa3a571b394cbe11", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18714, "upload_time": "2019-04-13T17:01:58", "url": "https://files.pythonhosted.org/packages/d4/43/16c54ec4004ad6e7571d2a639951b7ae39731beba70c679f7e4a464a813a/shmistogram-0.2.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1d5daa1826e2eda058f1e7a7988b8f91", "sha256": "babb55d8d93ef66d72cd03b310574a7604d21b0a08e60886fe73cc01f56fc495" }, "downloads": -1, "filename": "shmistogram-0.2.4.tar.gz", "has_sig": false, "md5_digest": "1d5daa1826e2eda058f1e7a7988b8f91", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17089, "upload_time": "2019-04-13T17:02:01", "url": "https://files.pythonhosted.org/packages/29/62/e3ce8c2850f2e9a8e4569dae6d31aa30fb9e8f366d9a2b43a2ed4d34cfa8/shmistogram-0.2.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2a5ca51a2dd7db7afa3a571b394cbe11", "sha256": "11ebb198e81ec45a92f6f51499fff27423e270046e26a86790a94491e6d10211" }, "downloads": -1, "filename": "shmistogram-0.2.4-py3-none-any.whl", "has_sig": false, "md5_digest": "2a5ca51a2dd7db7afa3a571b394cbe11", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18714, "upload_time": "2019-04-13T17:01:58", "url": "https://files.pythonhosted.org/packages/d4/43/16c54ec4004ad6e7571d2a639951b7ae39731beba70c679f7e4a464a813a/shmistogram-0.2.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1d5daa1826e2eda058f1e7a7988b8f91", "sha256": "babb55d8d93ef66d72cd03b310574a7604d21b0a08e60886fe73cc01f56fc495" }, "downloads": -1, "filename": "shmistogram-0.2.4.tar.gz", "has_sig": false, "md5_digest": "1d5daa1826e2eda058f1e7a7988b8f91", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17089, "upload_time": "2019-04-13T17:02:01", "url": "https://files.pythonhosted.org/packages/29/62/e3ce8c2850f2e9a8e4569dae6d31aa30fb9e8f366d9a2b43a2ed4d34cfa8/shmistogram-0.2.4.tar.gz" } ] }