{ "info": { "author": "Tomas Gavenciak", "author_email": "gavento@ucw.cz", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "Single-Linkage connectivity clustering\n======================================\n\nA fast algorithm computing connectivity components (clusters) of elements based\non their feature cosine similarity. Two elements are connected by an edge if\ntheir cosine similarity is above a given threshold. Accepts either a numpy\n2D array or a scipy column-sparse (CSR) matrix.\n\nInstall from `PyPI `_\nwith ``pip install silicon-clustering``\n\n Tom\u00e1\u0161 Gaven\u010diak ``gavento@ucw.cz``\n\nUsage example\n-------------\n\n::\n\n import silicon, numpy\n # 1000 rows, 10 features\n data = numpy.random.rand(1000, 10)\n # The ensemble normalizes rows of the data by default\n # Choose high verbosity (via logging), cosine similarity >=0.97\n ens = silicon.CosineClustering(data, sim_threshold=0.97, verbosity=2)\n ens.run()\n # (... progress reports)\n ens.clusters[0]\n # \n ens.clusters_by_size()[0]\n # \n # With pyplot you can see the projected points\n import matplotlib.pyplot as plt\n ens.plot(); plt.show()\n # or the individual clusters\n ens.clusters_by_size()[0].plot(ens); plt.show()\n\nDetails\n-------\n\nThe algorithm uses several tricks to speed up the computation compared to traditional\nall-pair scalar products via matrix multiplication:\n\n* The data is projected into ``cell_dims`` principal components (PCA) by feature. The\n projection divides the data into cells of size ``self.distance`` so only rows from\n adjacent cells have a chance to be similar enough. Then the vectors of the rows of\n the adjacent cells are multiplied.\n\n* This would still mean that some cells are too big for matrix multiplication so a second\n trick is used before the cell multiplication: nibbling at cca 1000 random points of the\n dataset. For a random center row, the similarities to all other rows are computed and\n all similar points are clustered together (possibly merging existing clusters). This has\n the effect of pre-clustering most dense points of the dataset (esp. repeated values)\n - dense clusters have a good chance to be hit with a center, eliminating most of the\n mass of the cluster (as well as the respective cells). The number of nibble\n iterations should be tuned according to data (to reasonably decrease the cell size).\n\n* To combine the two tricks, a portion of the clustered points (e.g. 10%) together with\n all the unclustered points are considered for adjacent cell multiplication. The 10%\n returned points should ensure that the nibbled clusters are linked with any points not\n hit by the nibble but close to and in effect belonging to the clusters.\n\nSince not all nibbled rows are used in the adjacent cell scalar product, the algorithm\nmay miss few individual cluster connections at nibble ball boundaries, but we found it\nunlikely in practice.\n\nThe algorithm clusters 10000 gaussian points in 30 dimensions in under a second.\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/CZ-NIC/silicon-clustering", "keywords": "connectivity clustering similarity PCA", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "silicon-clustering", "package_url": "https://pypi.org/project/silicon-clustering/", "platform": "", "project_url": "https://pypi.org/project/silicon-clustering/", "project_urls": { "Homepage": "https://github.com/CZ-NIC/silicon-clustering" }, "release_url": "https://pypi.org/project/silicon-clustering/0.3.1/", "requires_dist": [ "numpy (>=1.10)", "scikit-learn (>=0.18)", "scipy (>=0.19)", "six (>=1.10)" ], "requires_python": "", "summary": "Single-linkage connectivity clustering by cosine similarity", "version": "0.3.1" }, "last_serial": 2755952, "releases": { "0.3.1": [ { "comment_text": "", "digests": { "md5": "7b93e3c87abbdc6c09d7f6e9c200d150", "sha256": "264ceb7e3df1f39813e25a65276a624ec5540df1136f030e4fe7715ccca47f64" }, "downloads": -1, "filename": "silicon_clustering-0.3.1-py2-none-any.whl", "has_sig": false, "md5_digest": "7b93e3c87abbdc6c09d7f6e9c200d150", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 16617, "upload_time": "2017-04-05T20:10:28", "url": "https://files.pythonhosted.org/packages/19/5c/f323c6338361548ba082f01369b95b00ac46346eef1103aeb1bcf2094e4c/silicon_clustering-0.3.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb491ac399a008fad835cd9dbf198c93", "sha256": "0e221d3e68c0a9b62ed42039dfadd54c610f962b313fc15952b76b6a68e4b446" }, "downloads": -1, "filename": "silicon_clustering-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "eb491ac399a008fad835cd9dbf198c93", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16618, "upload_time": "2017-04-05T20:10:31", "url": "https://files.pythonhosted.org/packages/32/7a/916fe5913e8d9dd02b5d2061e980bfc0ee7305949deaf9d4c94f570d3f17/silicon_clustering-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "52652a872477815fac0dbcd74c3b934d", "sha256": "28e94dfbd8d0806436b1222ec8658622d21e08354e957e1b52b8f0cc957dc20a" }, "downloads": -1, "filename": "silicon-clustering-0.3.1.tar.gz", "has_sig": false, "md5_digest": "52652a872477815fac0dbcd74c3b934d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11860, "upload_time": "2017-04-05T20:10:32", "url": "https://files.pythonhosted.org/packages/33/07/25632217badc10da1bcbff428722b5c6b36dfcfccd4720e570bca0d96fba/silicon-clustering-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7b93e3c87abbdc6c09d7f6e9c200d150", "sha256": "264ceb7e3df1f39813e25a65276a624ec5540df1136f030e4fe7715ccca47f64" }, "downloads": -1, "filename": "silicon_clustering-0.3.1-py2-none-any.whl", "has_sig": false, "md5_digest": "7b93e3c87abbdc6c09d7f6e9c200d150", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 16617, "upload_time": "2017-04-05T20:10:28", "url": "https://files.pythonhosted.org/packages/19/5c/f323c6338361548ba082f01369b95b00ac46346eef1103aeb1bcf2094e4c/silicon_clustering-0.3.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb491ac399a008fad835cd9dbf198c93", "sha256": "0e221d3e68c0a9b62ed42039dfadd54c610f962b313fc15952b76b6a68e4b446" }, "downloads": -1, "filename": "silicon_clustering-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "eb491ac399a008fad835cd9dbf198c93", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16618, "upload_time": "2017-04-05T20:10:31", "url": "https://files.pythonhosted.org/packages/32/7a/916fe5913e8d9dd02b5d2061e980bfc0ee7305949deaf9d4c94f570d3f17/silicon_clustering-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "52652a872477815fac0dbcd74c3b934d", "sha256": "28e94dfbd8d0806436b1222ec8658622d21e08354e957e1b52b8f0cc957dc20a" }, "downloads": -1, "filename": "silicon-clustering-0.3.1.tar.gz", "has_sig": false, "md5_digest": "52652a872477815fac0dbcd74c3b934d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11860, "upload_time": "2017-04-05T20:10:32", "url": "https://files.pythonhosted.org/packages/33/07/25632217badc10da1bcbff428722b5c6b36dfcfccd4720e570bca0d96fba/silicon-clustering-0.3.1.tar.gz" } ] }