{ "info": { "author": "Marek Gagolewski", "author_email": "marek@gagolewski.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering" ], "description": "Genieclust Python Package (**under development**)\n=========================\n\nThe *Genie*+ Clustering Algorithm\n----------------------------------\n\nAuthor: [Marek Gagolewski](http://www.gagolewski.com)\n\nThe time needed to apply a hierarchical clustering algorithm is most\noften dominated by the number of computations of a pairwise dissimilarity\nmeasure. Such a constraint, for larger data sets, puts at a disadvantage\nthe use of all the classical linkage criteria but the single linkage one.\nHowever, it is known that the single linkage clustering algorithm is very\nsensitive to outliers, produces highly skewed dendrograms, and therefore\nusually does not reflect the true underlying data structure -\nunless the clusters are well-separated.\n\nTo overcome its limitations, we proposed a new hierarchical clustering linkage\ncriterion called Genie. Namely, our algorithm links two clusters in such\na way that a chosen economic inequity measure (here, the Gini index)\nof the cluster sizes does not increase drastically above a given threshold.\n\nBenchmarks indicate a high practical usefulness of the introduced method:\nit most often outperforms the Ward or average linkage, k-means,\nspectral clustering, DBSCAN, Birch, and others in terms of the clustering\nquality while retaining the single linkage speed. The algorithm is easily\nparallelizable and thus may be run on multiple threads to speed up its\nexecution further on. Its memory overhead is small: there is no need\nto precompute the complete distance matrix to perform the computations\nin order to obtain a desired clustering.\n\nSee: Gagolewski M., Bartoszuk M., Cena A.,\nGenie: A new, fast, and outlier-resistant hierarchical clustering algorithm,\n*Information Sciences* **363**, 2016, pp. 8-23.\ndoi:[10.1016/j.ins.2016.05.003](http://dx.doi.org/10.1016/j.ins.2016.05.003)\n\n\n\nThis is a new, faster and even more robust implementation\nof the original algorithm available on CRAN,\nsee R package [`genie`](http://www.gagolewski.com/software/genie/).\n\n\n\nPackage Features\n================\n\n* The Genie+ algorithm (using a `scikit-learn`-like interface),\ntogether with a robustified version of HDBSCAN*\n* DisjointSets data structure (with extensions)\n* Various inequity measures (the Gini index, the Bonferroni index, etc.)\n* Functions to compute partition similarity measures\n(the Rand, adjusted Rand, Fowlkes-Mallows, and adjusted Fowlkes-Mallows index)\n* An implementation of the Prim algorithm to compute the minimum spanning tree\n(@TODO@ parallelized, requiring O(n**2) time and O(n) memory)\n* Various plotting functions\n\n\nInstallation\n============\n\nThe package requires Python 3.6+ together with\n`sklearn`, `numpy`, `scipy`, `matplotlib`, and `cython`.\n\n\nVia `pip`:\n\n```python\n# @TODO@ - not yet on PyPI\n```\n\nThe most recent development version:\n\n```bash\ngit clone https://github.com/gagolews/genieclust.git\ncd genieclust\npython setup.py build_ext --inplace\npython setup.py install --user\n```\n\nExamples\n========\n\n* [The Genie Algorithm - basic use](https://github.com/gagolews/genieclust/blob/master/example_genie_basic.ipynb)\n* [The Genie Algorithm with Noise Points Detection](https://github.com/gagolews/genieclust/blob/master/example_genie_hdbscan.ipynb)\n\n\nLicense\n=======\n\nThis package is licensed under the BSD 3-Clause \"New\" or \"Revised\" License.\n\n```\nCopyright (C) 2018 Marek.Gagolewski.com\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without modification,\nare permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice,\nthis list of conditions and the following disclaimer.\n\n2. Redistributions in binary form must reproduce the above copyright notice,\nthis list of conditions and the following disclaimer in the documentation\nand/or other materials provided with the distribution.\n\n3. Neither the name of the copyright holder nor the names of its contributors\nmay be used to endorse or promote products derived from this software without\nspecific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE\nARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/gagolews/genieclust", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://www.gagolewski.com/software/", "keywords": "", "license": "BSD-3-Clause", "maintainer": "Marek Gagolewski", "maintainer_email": "marek@gagolewski.com", "name": "genieclust", "package_url": "https://pypi.org/project/genieclust/", "platform": "", "project_url": "https://pypi.org/project/genieclust/", "project_urls": { "Download": "https://github.com/gagolews/genieclust", "Homepage": "http://www.gagolewski.com/software/" }, "release_url": "https://pypi.org/project/genieclust/0.1a2/", "requires_dist": null, "requires_python": "", "summary": "The Genie+ Clustering Algorithm", "version": "0.1a2" }, "last_serial": 3891517, "releases": { "0.1a2": [ { "comment_text": "", "digests": { "md5": "f776c26e0e9bb1a407dd35caa4af0228", "sha256": "48f340767864e78fc19548207ab8d4689f53ff7140fb996c744bfcb66eb7c4e9" }, "downloads": -1, "filename": "genieclust-0.1a2.tar.gz", "has_sig": false, "md5_digest": "f776c26e0e9bb1a407dd35caa4af0228", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18821, "upload_time": "2018-05-23T13:48:04", "url": "https://files.pythonhosted.org/packages/d7/10/04b5d233ec56a5db528b04e6e350c846d1d48d6b31fb5d9daa85b5c79bb0/genieclust-0.1a2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f776c26e0e9bb1a407dd35caa4af0228", "sha256": "48f340767864e78fc19548207ab8d4689f53ff7140fb996c744bfcb66eb7c4e9" }, "downloads": -1, "filename": "genieclust-0.1a2.tar.gz", "has_sig": false, "md5_digest": "f776c26e0e9bb1a407dd35caa4af0228", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18821, "upload_time": "2018-05-23T13:48:04", "url": "https://files.pythonhosted.org/packages/d7/10/04b5d233ec56a5db528b04e6e350c846d1d48d6b31fb5d9daa85b5c79bb0/genieclust-0.1a2.tar.gz" } ] }