{ "info": { "author": "Christopher Baker", "author_email": "chriscrewbaker@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# validclust\n\n> Validate clustering results\n\n[![Linux Build Status](https://travis-ci.org/crew102/validclust.svg?branch=master)](https://travis-ci.org/crew102/validclust) \n[![PyPI version](https://img.shields.io/pypi/v/validclust.svg)](https://pypi.org/project/validclust/)\n\n## Motivation\n\nClustering algorithms often require that the analyst specify the number of clusters that exist in the data, a parameter commonly known as `k`. One approach to determining an appropriate value for `k` is to cluster the data using a range of values for `k`, then evaluate the quality of the resulting clusterings using a cluster validity index (CVI). The value of `k` that results in the best partitioning of the data according to the CVI is then chosen. `validclust` handles this process for the analyst, making it very easy to quickly determine an optimal value for `k`. \n\n## Installation\n\nYou can get the stable version from PyPI:\n\n```\npip install validclust\n```\n\nOr the development version from GitHub:\n\n```\npip install git+https://github.com/crew102/validclust.git\n```\n\n## Basic usage\n\n1. Load libraries.\n\n```python\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets.samples_generator import make_blobs\nfrom validclust.validclust import ValidClust\n```\n\n2. Create some synthetic data. The data will be clustered around 4 centers.\n\n```python\ndata, _ = make_blobs(n_samples=500, centers=4, n_features=5, random_state=0)\n```\n\n3. Use `ValidClust` to determine the optimal number of clusters. The code below will partition the data into 2-7 clusters using two different clustering algorithms, then calculate various CVIs across the results.\n\n```python\nvclust = ValidClust(\n k=list(range(2, 8)), \n methods=['hierarchical', 'kmeans']\n)\ncvi_vals = vclust.fit_predict(data)\nprint(cvi_vals)\n#> 2 3 4 5 \\\n#> method index \n#> hierarchical silhouette 0.645563 0.633970 0.747064 0.583724 \n#> calinski 1007.397799 1399.552836 3611.526187 2832.925655 \n#> davies 0.446861 0.567859 0.361996 1.025296 \n#> dunn 0.727255 0.475745 0.711415 0.109312 \n#> kmeans silhouette 0.645563 0.633970 0.747064 0.602562 \n#> calinski 1007.397799 1399.552836 3611.526187 2845.143428 \n#> davies 0.446861 0.567859 0.361996 0.988223 \n#> dunn 0.727255 0.475745 0.711415 0.115113 \n#> \n#> 6 7 \n#> method index \n#> hierarchical silhouette 0.435456 0.289567 \n#> calinski 2371.222506 2055.323553 \n#> davies 1.509404 1.902413 \n#> dunn 0.109312 0.116557 \n#> kmeans silhouette 0.468945 0.334379 \n#> calinski 2389.531071 2096.945591 \n#> davies 1.431102 1.722117 \n#> dunn 0.098636 0.072423 \n```\n\nIt's hard to see what the optimal value of `k` is from the raw CVI values shown above. Not all of the CVIs are on a 0-1 scale, and lower scores are actually associated with better clusterings for some of the indices. `ValidClust`'s `plot()` method solves this problem by first normalizing the CVIs and then displaying the results in a heatmap.\n\n```python\nvclust.plot()\n```\n\n![](https://i.imgur.com/lh4lROu.png)\n\nFor each row in the above grid (i.e., for each clustering method/CVI pair), darker cells are associated with higher-quality clusterings. From this plot we can see that each method/index pair seems to be pointing to 4 as being an optimal value for `k`.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://validclust.readthedocs.io", "keywords": "", "license": "LICENSE.txt", "maintainer": "", "maintainer_email": "", "name": "validclust", "package_url": "https://pypi.org/project/validclust/", "platform": "", "project_url": "https://pypi.org/project/validclust/", "project_urls": { "Homepage": "https://validclust.readthedocs.io" }, "release_url": "https://pypi.org/project/validclust/0.1.0/", "requires_dist": [ "scikit-learn", "pandas", "numpy", "seaborn", "matplotlib" ], "requires_python": "", "summary": "Validate clustering results", "version": "0.1.0" }, "last_serial": 4769358, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "4c1f954ae308be5db6c438461a11c4b0", "sha256": "1ed4efd7662d8e57cfc20c4d3606e2250d7020cadd6b12b4269deacd9d7637a9" }, "downloads": -1, "filename": "validclust-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c1f954ae308be5db6c438461a11c4b0", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7009, "upload_time": "2019-02-01T18:03:09", "url": "https://files.pythonhosted.org/packages/89/01/e4e148d4631bfffba1b20c00130fa959376709e378834c29f50f6ae2a62a/validclust-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb5dd95564ca60853095c2daa5295566", "sha256": "2d01d75076ba6cfeb5cec42fd1b3a41af262f4fc21c7df887fcd3f0c6e6dd46f" }, "downloads": -1, "filename": "validclust-0.1.0.tar.gz", "has_sig": false, "md5_digest": "eb5dd95564ca60853095c2daa5295566", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7455, "upload_time": "2019-02-01T18:03:11", "url": "https://files.pythonhosted.org/packages/ff/34/c91d6351d7cd43a39a75ee086e0c79f15a6693d1b685d52a612d86d2f9eb/validclust-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4c1f954ae308be5db6c438461a11c4b0", "sha256": "1ed4efd7662d8e57cfc20c4d3606e2250d7020cadd6b12b4269deacd9d7637a9" }, "downloads": -1, "filename": "validclust-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c1f954ae308be5db6c438461a11c4b0", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7009, "upload_time": "2019-02-01T18:03:09", "url": "https://files.pythonhosted.org/packages/89/01/e4e148d4631bfffba1b20c00130fa959376709e378834c29f50f6ae2a62a/validclust-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb5dd95564ca60853095c2daa5295566", "sha256": "2d01d75076ba6cfeb5cec42fd1b3a41af262f4fc21c7df887fcd3f0c6e6dd46f" }, "downloads": -1, "filename": "validclust-0.1.0.tar.gz", "has_sig": false, "md5_digest": "eb5dd95564ca60853095c2daa5295566", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7455, "upload_time": "2019-02-01T18:03:11", "url": "https://files.pythonhosted.org/packages/ff/34/c91d6351d7cd43a39a75ee086e0c79f15a6693d1b685d52a612d86d2f9eb/validclust-0.1.0.tar.gz" } ] }