{ "info": { "author": "Fei Zhan", "author_email": "enfeizhan@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Programming Language :: Python" ], "description": "=========\nClusteror\n=========\n\n* Unveils internal \"invisible\" patterns automatically\n* https://github.com/enfeizhan/clusteror\n* Fei Zhan\n* License: MIT License\n\nVersion support\n===============\nThis package is developed in Python 3.4. However support for Python 2 will be\nin the next release.\n\nDescription\n===========\n\nThis is a tool for discovering patterns existing in a dataset. It can be useful\nin segmenting customers based on their demographic, geographic, and past\nbehaviours in a commercial marketing environment. Users are encouraged to\nadventure with it in other scenarios.\n\nUnder the hood, a pretraining of \n`(Stacked) Denoising Autoencoder `__\nis implemented in\n`Python deep learning `__ library\n`Theano `__. Therefore, an installation\nof Theano is mandate. About how to install Theano, please read\n`Theano installation guide `__.\n\nIf you are lucky to have Nvidia graphic card, with proper setup Theano can\nparallel the calculation to be able to scale up to large datasets.\n\nWhy we need it?\n===============\n\nThe purpose of Unsupervised Machine Learning is to find underlying patterns\nwithout being provided with known categories as Supervised Classification\nMachine Learning. While it sounds like the most leveraged use case of Machine\nLearning as it does not rely on expensive labels that generally need to be\ncreated by human, its efficiency is debatable for commonly seen algorithms.\n\nRecall the very first clustering algorithm in machine learning lectures,\nK-Means. It looks amazing for beginners. The randomly initiated centroids\nseem to be intelligent to know where they should gravitate to. At least for\nthe example datasets for illustrating purposes.\n\nBut real life problems aren't that straightforward. It's not unlikely you are\ngiven a dataset that looks like this in a 2D space:\n\n.. image:: tests/readme_pics/original_problem.png\n\nUnfortunately this is the best K-Means can do and you probably scratch you head\nwondering what you can do to solve this seemingly straightforward problem.\n\n.. image:: tests/readme_pics/bad_kmeans.png\n\nThe cost function of K-Means instructs centroids to search for points\nlocated in a spharical region centring around them. This assumption of what\na cluster is no doubt fails when a cluster of points resides in a stripe\nshape like in the example above.\n\nHow can we do better?\n=====================\nThis \"unexpected\" (actually this is a well-known fact) failure stems from that\nK-Means fumbles in higher dimensional space. While this terminates K-Means to\nbe an awesome clustering tool that is deployable in extensive environments,\nits performance in one dimensional space has been ignored largely due to\nexamples in two dimension are more illustrative.\n\nHaving this in mind, we would work on to search ways of reducing high\ndimensional dataset to one dimension. As this is truly the cause responsible\nfor all strugglings in clustering high dimensional data.\n\nFortunately, the progress in Deep Learning sheds new light on this issue. As\nlabeled dataset is expensive, experts in this area came up with the salient\nidea of Pre-training, which enormously reduced the amount of labeled dataset\nneeded to train a deep learning neural network.\n\nIn a nutshell, pre-training updates the parameters in neural network such that\nan input can traverse the network forward and back therefore reproduce itself.\nA cost function is defined regarding the difference between the input and\nreproduced input instead of input and output.\n\nWhile it's highly recommended to read research articles to gain more detail,\na characteristic property of this process is tremendously useful for reducing\ndataset in high dimension down to lower dimension.\n\nThe central idea is given the input can be loyally or nearly loyally\nreproduced the neural network can be expected to do two things:\n\n* one-to-one mapping input to output; A different input results in a different\n output.\n* Similar inputs lead to close outputs.\n\nThese two characteristics ensure clustering done in low-dimensional space\nis equivalent to do the clustering in high dimension.\n\nA picture is worth a thousand words. The chart below is the result of a three\nlayer neural network:\n\n.. image:: tests/readme_pics/leveraged_kmeans.png\n\nThe magic lies in how it looks in the mapped one dimensional space:\n\n.. image:: tests/readme_pics/hist.png\n\nWithout too much explanation, the left blue bars are the from the right blue\npoints and the right red bars from the left red points. As there isn't\nconcept for a sphere in 1d, K-Means works perfectly.\n\nWhat's in the package and what's coming up?\n===========================================\nAt the moment, you can do the following with the tools provided by the\npackage:\n\n* Dimension reduction to 1d.\n* Clustering in 1d and assign cluster ids back to the original dataset.\n* Certain useful plotting tools to display the efficiency of the clustering.\n\nThis project is still in alpha stage. In the imminent release, you can expect\n\n* Tools to searching for the drivers that distinct the clusters.\n* More support in Python versions.\n\nNote\n====\n\nThis project has been set up using PyScaffold 2.5.7. For details and usage\ninformation on PyScaffold see http://pyscaffold.readthedocs.org/.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://...", "keywords": "", "license": "none", "maintainer": "", "maintainer_email": "", "name": "clusteror", "package_url": "https://pypi.org/project/clusteror/", "platform": "any", "project_url": "https://pypi.org/project/clusteror/", "project_urls": { "Homepage": "http://..." }, "release_url": "https://pypi.org/project/clusteror/0.0.2/", "requires_dist": null, "requires_python": "", "summary": "Dataset pattern discovery tool", "version": "0.0.2" }, "last_serial": 2515815, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "420a9f869d3e408011ae244f93854fa2", "sha256": "650c01f279a72279ae3c9b1863ccc94940bde5854d3813c2450770930874d3a6" }, "downloads": -1, "filename": "clusteror-0.0.1.tar.gz", "has_sig": false, "md5_digest": "420a9f869d3e408011ae244f93854fa2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2259721, "upload_time": "2016-12-08T01:23:37", "url": "https://files.pythonhosted.org/packages/67/23/62349c9fc766ead2928bb565b8cf3385a34a39e0e63d006651678d3d6c01/clusteror-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "d677de10b05f5cf55f00b088e11ab48c", "sha256": "7b1686679eb142c0a2fcdd5bff0b5318f0f1bd2a82e12eeb73231eb87f42afae" }, "downloads": -1, "filename": "clusteror-0.0.2.tar.gz", "has_sig": false, "md5_digest": "d677de10b05f5cf55f00b088e11ab48c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2258519, "upload_time": "2016-12-13T06:34:29", "url": "https://files.pythonhosted.org/packages/78/21/b8f05ca9015962b2d53bb1ad099dd7e9b05c097fdd3be95943261efa25d6/clusteror-0.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d677de10b05f5cf55f00b088e11ab48c", "sha256": "7b1686679eb142c0a2fcdd5bff0b5318f0f1bd2a82e12eeb73231eb87f42afae" }, "downloads": -1, "filename": "clusteror-0.0.2.tar.gz", "has_sig": false, "md5_digest": "d677de10b05f5cf55f00b088e11ab48c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2258519, "upload_time": "2016-12-13T06:34:29", "url": "https://files.pythonhosted.org/packages/78/21/b8f05ca9015962b2d53bb1ad099dd7e9b05c097fdd3be95943261efa25d6/clusteror-0.0.2.tar.gz" } ] }