{ "info": { "author": "Bernd Fritzke", "author_email": "fritzke@web.de", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: BSD License", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "\n# kmeanstf\n## k-means++ implementation based on tensorflow-gpu\n![alt text][logo]\n\n[logo]: https://raw.githubusercontent.com/gittar/kmeanstf/master/img/million_100.png \"k-means++ example\"\n# Quick Start\n\nTo use k-means++ with GPU-support do the following:\n\n```pip install kmeanstf```\n\n(requires tensorflow-gpu, at least version 1.14 or 2.0b)\n\nExecute the following [test program](https://github.com/gittar/kmeanstf/blob/master/demo/test.py) to produce the above graphic.\n\n```python\nimport tensorflow as tf\nimport matplotlib.pyplot as plt\nfrom kmeanstf import KMeansTF\n\n# create data set\nX = tf.random.normal([50000,2])\n# create kmeanstf object\nkm = KMeansTF(n_clusters=100, n_init=1)\n# adapt\nkm.fit(X)\n\n# plot result (optional)\nm=10000 # max number of data points to display\nfig,ax = plt.subplots(figsize=(8,8))\nax.scatter(X[:m,0],X[:m,1],s=1)\nax.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=8,c=\"r\")\nplt.show()\n```\n\n\n\n\n# What is k-means?\n[k-means](https://en.wikipedia.org/wiki/K-means_clustering) - a.k.a. Lloyds Algorithm - is perhaps the most well-known clustering method. It positions k centroids (means) over a given data set in order to represent/cluster the data set.\n\nStarting from some initial positions of the centroids a number of so-called \"Lloyd iterations\" is performed until a stopping criterion is met. One Lloyd iteration consists of the following two steps:\n* determine for each centroid the subset of data points for which this centroid is closest\n* move each centroid to the mean of its associated data sub-set (hence the name \"k-means\")\n\nIt can be shown that each Lloyd iteration decreases the sum of squared Euclidean distances of all data points to their respective nearest centroid or leaves it unchanged in which case the algorithm has converged.\n\nThe final sum of distances depends strongly on the initial centroid positions. A common goal is to find a set of centroids with a small sum of distances (see example below). \n\n\n# k-means++\nIn 2006 a particular initialization method was proposed which produces provably good (but generally still sub-optimal) results: [k-means++](http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf). One can describe the core of k-means++ as the sequential positioning of centroids in areas where the given data points are far from the already positioned centroids. k-means++ is said to both generate better solutions than k-means with random initialization as well as needing fewer Lloyd iterations.\n\nProbably due to its good results k-means++ has been chosen as the default initialization method for the implementation of k-means in the popular [scikit-learn python library](https://scikit-learn.org) for machine learning\n\n# why kmeanstf?\n\nk-means for large data sets and large number of centroids is computationally expensive (slow!). For portability reasons scikit-learn does not support the use of GPUs (https://scikit-learn.org/stable/faq.html#will-you-add-gpu-support). Therefore we provide here an implementation of k-means (and the k-means++ initialization) based on tensorflow and making use of a GPU if available. The k-means++ initialization is a port from the [scikit-learn implementation of k-means++](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py) to tensorflow (using tensorflow matrix operations instead of the numpy ones used by scikit-learn). The tensorflow-based implementation of Lloyd iterations borrows some aspects from \nShawn Simister's gist https://gist.github.com/narphorium/d06b7ed234287e319f18 and added the splitting of large data sets into subsets of a given size as well as finding a suitable size automatically for a given GPU if memory errors occur with the default size.\n\n**Please note:** Only a subset of the scikit-learn KMeans class is implemented in KMeansTF, in particular the fit(), predict() and fit_predict() methods. There is e.g. no handling of sparse matrices, no mini-batch version, no sample weights. For these features please use scikit-learn. Using kmeanstf makes most sense if you like to perform standard k-means (k-means++) on a GPU for large data sets.\n# Speed-up\n\nOn a linux machine (Ubuntu 18.04 LTS) equipped with a NVidia GTX-1060 6MB graphics card we observed a speed-up of 7-10 for larger data sets (n > 200k points). Example: for a data set of n = 10\u2076 (1 million) 2-dimensional vectors and k=100 centroids the execution time for 30 Lloyd iterations was 6.34 seconds a.o.t. 62.17 seconds for scikit-learn (this is a speed-up of 9.81)\n\n![alt text][barspeed]\n\n[barspeed]: https://raw.githubusercontent.com/gittar/kmeanstf/master/img/barspeed3.png \"execution times\"\n\nBelow you see speed-up values measured for different values of data set size *n* and number of centroids *k* for 2D-data. For larger data sets also the speed-up tends to be higher. For small data sets, however, KMeansTF is actually often slower than scikit-learn KMeans. Perhaps this is caused caused by a start-up time for tensorflow. One could argue that this is not so relevant but it may also indicate potential for improvement for the kmeanstf implementation.\n![alt text][speed-up]\n\n\n[speed-up]: https://raw.githubusercontent.com/gittar/kmeanstf/master/img/speedupa2.png \"speed-up factors\"\n\n# Why is it so fast?\n\nThree lines of code which seem to originate from Shawn Simister (https://twitter.com/narphorium/status/668234313662525440?lang=en) are at the core of the tensorflow k-means implementation:\n\n```python\nexpanded_vectors = tf.expand_dims(samples, 0) #[1,n,d]\nexpanded_centroids = tf.expand_dims(centroids, 1) # [k,1,d]\ndistances = tf.reduce_sum( tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2) #[k,n]\n```\nThereby **samples** is the [*n,d*] tensor containing *n* vectors of dimension *d*.\n\n**centroids** is the [*k,d*] tensor containing *k* vectors (centroids) of dimension *d*.\n\nThe first two lines add a dimension of size 1 to samples and centroids in position 0 and 1 respectively\n\nThe third line contains the inner statement\n```python\ntf.subtract(expanded_vectors, expanded_centroids) #[k,n,d]\n```\nThis makes use of \"broadcasting\", an operation in tensorflow and numpy which aligns the shape of tensors before an arithmetic operation is applied which requires them to have the same shape. In this case the [1,n,d]-tensor **expanded_vectors** is converted to shape [*k,n,d*] by stacking *k* copies of **samples** along dimension 0. The [*k,1,d*] tensor **expanded_centroids** is converted to shape [*k,n,d*] as well by stacking *n* copies of **centroids** along dimension 1. \n\nThis makes it possible to compute the elementwise mutual distance of all *n* data vectors and all *k* centroids. Adding the outer tf.square and tf.reduce_sum operations we have one single tensor statement which does the work of 4 nested conventional loops. \n\nPure elegance! \n\nAnd ideally suited to run on a GPU (if it has enough memory, because *k\\*n\\*d* can be quite large). To not make the GPU memory a limiting factor, the implementation in kmeanstf is able to split up this tensor along the *n*-dimension into several pieces each fitting on the GPU. See parameter **max_mem** below.\n\n# class KMeansTF\n\n**Please note:** the interface is designed to be a subset of [sklearn.cluster.KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans). \nCurrent exception: parameter **max_mem** (see below)\n\n## class kmeanstf.KMeansTF(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=1e-4, verbose=0, max_mem=1300000000)\n## Parameters\n* **n_clusters : int, optional, default: 8**\n\n The number of clusters to form as well as the number of centroids to generate.\n* **init : {\u2018k-means++\u2019, \u2018random\u2019 or an ndarray}**\n\n Method for initialization, defaults to \u2018k-means++\u2019:\n\n \u2018k-means++\u2019 : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.\n\n \u2018random\u2019: choose k observations (rows) at random from data for the initial centroids.\n\n If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.\n\n* **n_init : int, default: 10**\n\n Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.\n* **max_iter : int, default: 300**\n\n Maximum number of iterations of the k-means algorithm for a single run.\n* **tol : float, default: 1e-4**\n\n Relative tolerance with regards to inertia to declare convergence\n* **verbose : int, default: 0**\n if larger 0 outputs messages to analyze the working of the algorithm\n\n* **max_mem : int, default: 1300000000**\n (no equivalent in sklearn) size in bytes of largest tensor to be used during k-means computation. Default value is suitable for NVidia GTX-1060. If a too large value is specified and a memory error occurs, a sequence of decreasing values is automatically tried out (and printed) until a working value for the given GPU is found. This value should subsequently be used as the **\u1e3fax_mem** parameter. Large data sets ar split into fractions in order to not pass this threshold.\n\n## Attributes\n\n* **cluster_centers_ : array, [n_clusters, n_features]**\n\n Coordinates of cluster centers. \n\n* **inertia_ : float**\n\n Sum of squared distances of samples to their closest cluster center.\n* **n_iter_ : int**\n\n Number of iterations run.\n\n## Methods\n\n* **fit (self, X)**\n\n Compute k-means clustering.\n\n* **fit_predict (self, X)**\n\n Compute cluster centers and predict cluster index for each sample.\n\n **Parameters**\n X: data to be clustered and predicted, shape = [n_samples, n_features]\n\n **Returns:**\n labels : array, shape [n_samples,]\n\n Index of the cluster each sample belongs to\n\n* **predict (self, X)** \n\n Predict cluster index for each sample.\n\n **Parameters**\n X: data to be predicted, shape = [n_samples, n_features]\n\n **Returns:**\n labels : array, shape [n_samples,]\n\n Index of the cluster each sample belongs to\n# Areas for further improvements\n\n* improve the time behavior for smaller data sets\n* directly determine suitable max_mem value from physical features of present GPU\n* enable the use of several GPUs\n* enable the use of TPUs (kmeanstf does work on colab with GPU, but not yet with TPU)\n\nBoth feedback and pull requests are welcome.\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gittar/kmeanstf", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "kmeanstf", "package_url": "https://pypi.org/project/kmeanstf/", "platform": "", "project_url": "https://pypi.org/project/kmeanstf/", "project_urls": { "Homepage": "https://github.com/gittar/kmeanstf" }, "release_url": "https://pypi.org/project/kmeanstf/0.4.0/", "requires_dist": [ "tensorflow-gpu (>=1.14)", "numpy", "matplotlib", "scikit-learn" ], "requires_python": ">=3.6.0", "summary": "k-means++ implementation based on tensorflow-gpu", "version": "0.4.0" }, "last_serial": 5495503, "releases": { "0.2.1": [ { "comment_text": "", "digests": { "md5": "a661ec5894d6bee5633eddfb1a3c1fb1", "sha256": "c2c933c39d242c0116e07e9d6805b096003d66bab8cbcda83a95988c6b554426" }, "downloads": -1, "filename": "kmeanstf-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "a661ec5894d6bee5633eddfb1a3c1fb1", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 11542, "upload_time": "2019-07-05T13:15:35", "url": "https://files.pythonhosted.org/packages/1c/03/2b30e56b8b5f4735f87f041dd2dbfec3a63b7bca79abba5b5ec1c7204f81/kmeanstf-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a88e245f85855626f5ef656a105fce66", "sha256": "6310340a4af580bb2fecc15b8f172d1dc06c384186395940731eb6f207cc8a5d" }, "downloads": -1, "filename": "kmeanstf-0.2.1.tar.gz", "has_sig": false, "md5_digest": "a88e245f85855626f5ef656a105fce66", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 11429, "upload_time": "2019-07-05T13:15:37", "url": "https://files.pythonhosted.org/packages/7f/8f/ea848460a04068c4ef1154b44a4959574deee829b5d6f7cfe0ad474f75d3/kmeanstf-0.2.1.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "ee17191d569e5bd23b83dc60ec9e4d08", "sha256": "0d479fcf491b8332ccb16b194e9290db607ce465df367441757f0388d0e42630" }, "downloads": -1, "filename": "kmeanstf-0.3.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ee17191d569e5bd23b83dc60ec9e4d08", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 11809, "upload_time": "2019-07-06T18:45:16", "url": "https://files.pythonhosted.org/packages/e8/a6/a6ca8c045556cc359ddef0b367a9f5765e46e27aa1b1448765003e46db72/kmeanstf-0.3.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "49161d2266a246e1b6d8f17c8d6aa4f1", "sha256": "8c327910f5f174eb4181929e9de0a8fd09bc79329263ac53c60182b0ac16dd37" }, "downloads": -1, "filename": "kmeanstf-0.3.0.tar.gz", "has_sig": false, "md5_digest": "49161d2266a246e1b6d8f17c8d6aa4f1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 11685, "upload_time": "2019-07-06T18:45:18", "url": "https://files.pythonhosted.org/packages/85/01/1c01ce17935b52d46404999d7caebc857230c5ba39dc5616069457d5739d/kmeanstf-0.3.0.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "d2e1551c3c8fc0973cae61bf1611bda9", "sha256": "669d6c625f49c159ab0e06e7eca7dc350d92ae37a38f7dd16f576543a130e46e" }, "downloads": -1, "filename": "kmeanstf-0.4.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d2e1551c3c8fc0973cae61bf1611bda9", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 11813, "upload_time": "2019-07-06T19:51:27", "url": "https://files.pythonhosted.org/packages/7a/27/452ddcc09d6c5804ce32e2f43aedb8e8420c8fbce521447048087ed36645/kmeanstf-0.4.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6601a49f1f8645659ebd4028e7f24948", "sha256": "4a8792ca6c20840228b5a56dc5bd092153344424e1c6cc78f880b6064221c852" }, "downloads": -1, "filename": "kmeanstf-0.4.0.tar.gz", "has_sig": false, "md5_digest": "6601a49f1f8645659ebd4028e7f24948", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 11692, "upload_time": "2019-07-06T19:51:29", "url": "https://files.pythonhosted.org/packages/24/20/0c4a2f074095f8a3c207836a54b7971ae75470b987580fd317673a563b28/kmeanstf-0.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d2e1551c3c8fc0973cae61bf1611bda9", "sha256": "669d6c625f49c159ab0e06e7eca7dc350d92ae37a38f7dd16f576543a130e46e" }, "downloads": -1, "filename": "kmeanstf-0.4.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d2e1551c3c8fc0973cae61bf1611bda9", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 11813, "upload_time": "2019-07-06T19:51:27", "url": "https://files.pythonhosted.org/packages/7a/27/452ddcc09d6c5804ce32e2f43aedb8e8420c8fbce521447048087ed36645/kmeanstf-0.4.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6601a49f1f8645659ebd4028e7f24948", "sha256": "4a8792ca6c20840228b5a56dc5bd092153344424e1c6cc78f880b6064221c852" }, "downloads": -1, "filename": "kmeanstf-0.4.0.tar.gz", "has_sig": false, "md5_digest": "6601a49f1f8645659ebd4028e7f24948", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 11692, "upload_time": "2019-07-06T19:51:29", "url": "https://files.pythonhosted.org/packages/24/20/0c4a2f074095f8a3c207836a54b7971ae75470b987580fd317673a563b28/kmeanstf-0.4.0.tar.gz" } ] }