{ "info": { "author": "Krisztian Szucs, Andras Fulop", "author_email": "krisztian.szucs@lensa.com, andras.fulop@lensa.com", "bugtrack_url": null, "classifiers": [], "description": "Sparkit-learn\n=============\n\n|Build Status| |PyPi| |Gitter|\n\n**PySpark + Scikit-learn = Sparkit-learn**\n\nGitHub: https://github.com/lensacom/sparkit-learn\n\nAbout\n=====\n\nSparkit-learn aims to provide scikit-learn functionality and API on\nPySpark. The main goal of the library is to create an API that stays\nclose to sklearn's.\n\nThe driving principle was to *\"Think locally, execute distributively.\"*\nTo accomodate this concept, the basic data block is always an array or a\n(sparse) matrix and the operations are executed on block level.\n\n\nRequirements\n============\n\n- **Python 2.7.x or 3.4.x**\n- **Spark[>=1.3.0]**\n- NumPy[>=1.9.0]\n- SciPy[>=0.14.0]\n- Scikit-learn[>=0.16]\n\n\n\nRun IPython from notebooks directory\n====================================\n\n.. code:: bash\n\n PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS=\"notebook\" ${SPARK_HOME}/bin/pyspark --master local\\[4\\] --driver-memory 2G\n\n\nRun tests with\n==============\n\n.. code:: bash\n\n ./runtests.sh\n\n\nQuick start\n===========\n\nSparkit-learn introduces three important distributed data format:\n\n- **ArrayRDD:**\n\n A *numpy.array* like distributed array\n\n .. code:: python\n\n from splearn.rdd import ArrayRDD\n\n data = range(20)\n # PySpark RDD with 2 partitions\n rdd = sc.parallelize(data, 2) # each partition with 10 elements\n # ArrayRDD\n # each partition will contain blocks with 5 elements\n X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition\n\n Basic operations:\n\n .. code:: python\n\n len(X) # 20 - number of elements in the whole dataset\n X.blocks # 4 - number of blocks\n X.shape # (20,) - the shape of the whole dataset\n\n X # returns an ArrayRDD\n # from PythonRDD...\n\n X.dtype # returns the type of the blocks\n # numpy.ndarray\n\n X.collect() # get the dataset\n # [array([0, 1, 2, 3, 4]),\n # array([5, 6, 7, 8, 9]),\n # array([10, 11, 12, 13, 14]),\n # array([15, 16, 17, 18, 19])]\n\n X[1].collect() # indexing\n # [array([5, 6, 7, 8, 9])]\n\n X[1] # also returns an ArrayRDD!\n\n X[1::2].collect() # slicing\n # [array([5, 6, 7, 8, 9]),\n # array([15, 16, 17, 18, 19])]\n\n X[1::2] # returns an ArrayRDD as well\n\n X.tolist() # returns the dataset as a list\n # [0, 1, 2, ... 17, 18, 19]\n X.toarray() # returns the dataset as a numpy.array\n # array([ 0, 1, 2, ... 17, 18, 19])\n\n # pyspark.rdd operations will still work\n X.getNumPartitions() # 2 - number of partitions\n\n\n- **SparseRDD:**\n\n The sparse counterpart of the *ArrayRDD*, the main difference is that the\n blocks are sparse matrices. The reason behind this split is to follow the\n distinction between *numpy.ndarray*s and *scipy.sparse* matrices.\n Usually the *SparseRDD* is created by *splearn*'s transformators, but one can\n instantiate too.\n\n .. code:: python\n\n # generate a SparseRDD from a text using SparkCountVectorizer\n from splearn.rdd import SparseRDD\n from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS\n ALL_FOOD_DOCS\n #(u'the pizza pizza beer copyright',\n # u'the pizza burger beer copyright',\n # u'the the pizza beer beer copyright',\n # u'the burger beer beer copyright',\n # u'the coke burger coke copyright',\n # u'the coke burger burger',\n # u'the salad celeri copyright',\n # u'the salad salad sparkling water copyright',\n # u'the the celeri celeri copyright',\n # u'the tomato tomato salad water',\n # u'the tomato salad water copyright')\n\n # ArrayRDD created from the raw data\n X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)\n X.collect()\n # [array([u'the pizza pizza beer copyright',\n # u'the pizza burger beer copyright'], dtype=' from PythonRDD...\n\n # it's type is the scipy.sparse's general parent\n X.dtype\n # scipy.sparse.base.spmatrix\n\n # slicing works just like in ArrayRDDs\n X[2:4].collect()\n # [<2x11 sparse matrix of type ''\n # with 7 stored elements in Compressed Sparse Row format>,\n # <2x11 sparse matrix of type ''\n # with 9 stored elements in Compressed Sparse Row format>]\n\n # general mathematical operations are available\n X.sum(), X.mean(), X.max(), X.min()\n # (55, 0.45454545454545453, 2, 0)\n\n # even with axis parameters provided\n X.sum(axis=1)\n # matrix([[5],\n # [5],\n # [6],\n # [5],\n # [5],\n # [4],\n # [4],\n # [6],\n # [5],\n # [5],\n # [5]])\n\n # It can be transformed to dense ArrayRDD\n X.todense()\n # from PythonRDD...\n X.todense().collect()\n # [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],\n # [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),\n # array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],\n # [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),\n # array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],\n # [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),\n # array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],\n # [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),\n # array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],\n # [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),\n # array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]\n\n # One can instantiate SparseRDD manually too:\n sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)\n sparse = SparseRDD(sparse, bsize=5)\n sparse\n # from PythonRDD...\n\n sparse.collect()\n # [<10x2 sparse matrix of type ''\n # with 10 stored elements in Compressed Sparse Row format>,\n # <10x2 sparse matrix of type ''\n # with 10 stored elements in Compressed Sparse Row format>,\n # <10x2 sparse matrix of type ''\n # with 10 stored elements in Compressed Sparse Row format>,\n # <10x2 sparse matrix of type ''\n # with 10 stored elements in Compressed Sparse Row format>]\n\n\n- **DictRDD:**\n\n A column based data format, each column with it's own type.\n\n .. code:: python\n\n from splearn.rdd import DictRDD\n\n X = range(20)\n y = list(range(2)) * 10\n # PySpark RDD with 2 partitions\n X_rdd = sc.parallelize(X, 2) # each partition with 10 elements\n y_rdd = sc.parallelize(y, 2) # each partition with 10 elements\n # DictRDD\n # each partition will contain blocks with 5 elements\n Z = DictRDD((X_rdd, y_rdd),\n columns=('X', 'y'),\n bsize=5,\n dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition\n # if no dtype is provided, the type of the blocks will be determined\n # automatically\n\n # or:\n import numpy as np\n\n data = np.array([range(20), list(range(2))*10]).T\n rdd = sc.parallelize(data, 2)\n Z = DictRDD(rdd,\n columns=('X', 'y'),\n bsize=5,\n dtype=[np.ndarray, np.ndarray])\n\n Basic operations:\n\n .. code:: python\n\n len(Z) # 8 - number of blocks\n Z.columns # returns ('X', 'y')\n Z.dtype # returns the types in correct order\n # [numpy.ndarray, numpy.ndarray]\n\n Z # returns a DictRDD\n # from PythonRDD...\n\n Z.collect()\n # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n # (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n # (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),\n # (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]\n\n Z[:, 'y'] # column select - returns an ArrayRDD\n Z[:, 'y'].collect()\n # [array([0, 1, 0, 1, 0]),\n # array([1, 0, 1, 0, 1]),\n # array([0, 1, 0, 1, 0]),\n # array([1, 0, 1, 0, 1])]\n\n Z[:-1, ['X', 'y']] # slicing - DictRDD\n Z[:-1, ['X', 'y']].collect()\n # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n # (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n # (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]\n\n\nBasic workflow\n--------------\n\nWith the use of the described data structures, the basic workflow is\nalmost identical to sklearn's.\n\nDistributed vectorizing of texts\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nSparkCountVectorizer\n^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n from splearn.rdd import ArrayRDD\n from splearn.feature_extraction.text import SparkCountVectorizer\n from sklearn.feature_extraction.text import CountVectorizer\n\n X = [...] # list of texts\n X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContext\n\n local = CountVectorizer()\n dist = SparkCountVectorizer()\n\n result_local = local.fit_transform(X)\n result_dist = dist.fit_transform(X_rdd) # SparseRDD\n\n\nSparkHashingVectorizer\n^^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n from splearn.rdd import ArrayRDD\n from splearn.feature_extraction.text import SparkHashingVectorizer\n from sklearn.feature_extraction.text import HashingVectorizer\n\n X = [...] # list of texts\n X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContext\n\n local = HashingVectorizer()\n dist = SparkHashingVectorizer()\n\n result_local = local.fit_transform(X)\n result_dist = dist.fit_transform(X_rdd) # SparseRDD\n\n\nSparkTfidfTransformer\n^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n from splearn.rdd import ArrayRDD\n from splearn.feature_extraction.text import SparkHashingVectorizer\n from splearn.feature_extraction.text import SparkTfidfTransformer\n from splearn.pipeline import SparkPipeline\n\n from sklearn.feature_extraction.text import HashingVectorizer\n from sklearn.feature_extraction.text import TfidfTransformer\n from sklearn.pipeline import Pipeline\n\n X = [...] # list of texts\n X_rdd = ArrayRDD(sc.parallelize(X, 4)) # sc is SparkContext\n\n local_pipeline = Pipeline((\n ('vect', HashingVectorizer()),\n ('tfidf', TfidfTransformer())\n ))\n dist_pipeline = SparkPipeline((\n ('vect', SparkHashingVectorizer()),\n ('tfidf', SparkTfidfTransformer())\n ))\n\n result_local = local_pipeline.fit_transform(X)\n result_dist = dist_pipeline.fit_transform(X_rdd) # SparseRDD\n\n\nDistributed Classifiers\n~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from splearn.rdd import DictRDD\n from splearn.feature_extraction.text import SparkHashingVectorizer\n from splearn.feature_extraction.text import SparkTfidfTransformer\n from splearn.svm import SparkLinearSVC\n from splearn.pipeline import SparkPipeline\n\n from sklearn.feature_extraction.text import HashingVectorizer\n from sklearn.feature_extraction.text import TfidfTransformer\n from sklearn.svm import LinearSVC\n from sklearn.pipeline import Pipeline\n\n X = [...] # list of texts\n y = [...] # list of labels\n X_rdd = sc.parallelize(X, 4)\n y_rdd = sc.parralelize(y, 4)\n Z = DictRDD((X_rdd, y_rdd),\n columns=('X', 'y'),\n dtype=[np.ndarray, np.ndarray])\n\n local_pipeline = Pipeline((\n ('vect', HashingVectorizer()),\n ('tfidf', TfidfTransformer()),\n ('clf', LinearSVC())\n ))\n dist_pipeline = SparkPipeline((\n ('vect', SparkHashingVectorizer()),\n ('tfidf', SparkTfidfTransformer()),\n ('clf', SparkLinearSVC())\n ))\n\n local_pipeline.fit(X, y)\n dist_pipeline.fit(Z, clf__classes=np.unique(y))\n\n y_pred_local = local_pipeline.predict(X)\n y_pred_dist = dist_pipeline.predict(Z[:, 'X'])\n\n\nDistributed Model Selection\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from splearn.rdd import DictRDD\n from splearn.grid_search import SparkGridSearchCV\n from splearn.naive_bayes import SparkMultinomialNB\n\n from sklearn.grid_search import GridSearchCV\n from sklearn.naive_bayes import MultinomialNB\n\n X = [...]\n y = [...]\n X_rdd = sc.parallelize(X, 4)\n y_rdd = sc.parralelize(y, 4)\n Z = DictRDD((X_rdd, y_rdd),\n columns=('X', 'y'),\n dtype=[np.ndarray, np.ndarray])\n\n parameters = {'alpha': [0.1, 1, 10]}\n fit_params = {'classes': np.unique(y)}\n\n local_estimator = MultinomialNB()\n local_grid = GridSearchCV(estimator=local_estimator,\n param_grid=parameters)\n\n estimator = SparkMultinomialNB()\n grid = SparkGridSearchCV(estimator=estimator,\n param_grid=parameters,\n fit_params=fit_params)\n\n local_grid.fit(X, y)\n grid.fit(Z)\n\n\nSpecial thanks\n==============\n\n- scikit-learn community\n- spylearn community\n- pyspark community\n\n\n.. |Build Status| image:: https://travis-ci.org/lensacom/sparkit-learn.png?branch=master\n :target: https://travis-ci.org/lensacom/sparkit-learn\n.. |PyPi| image:: https://img.shields.io/pypi/v/sparkit-learn.svg\n :target: https://pypi.python.org/pypi/sparkit-learn\n.. |Gitter| image:: https://badges.gitter.im/Join%20Chat.svg\n :alt: Join the chat at https://gitter.im/lensacom/sparkit-learn\n :target: https://gitter.im/lensacom/sparkit-learn?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/lensacom/sparkit-learn", "keywords": null, "license": "Apache License, Version 2.0", "maintainer": null, "maintainer_email": null, "name": "sparkit-learn", "package_url": "https://pypi.org/project/sparkit-learn/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/sparkit-learn/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/lensacom/sparkit-learn" }, "release_url": "https://pypi.org/project/sparkit-learn/0.2.6/", "requires_dist": null, "requires_python": null, "summary": "Scikit-learn on PySpark", "version": "0.2.6" }, "last_serial": 1604707, "releases": { "0.1.5": [ { "comment_text": "", "digests": { "md5": "48b9ab40b50d13d46fed896539deb0d5", "sha256": "30e2672713b0b7acb57699f5c46e4902ce0ef62955452d8cfd9c64fd6ba2d9a9" }, "downloads": -1, "filename": "sparkit-learn-0.1.5.macosx-10.10-intel.exe", "has_sig": false, "md5_digest": "48b9ab40b50d13d46fed896539deb0d5", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 116528, "upload_time": "2015-06-11T15:31:49", "url": "https://files.pythonhosted.org/packages/0f/ea/24fd989c913d10f0e669d51e17c28d3c0de88665399b81cbbdb9642b5934/sparkit-learn-0.1.5.macosx-10.10-intel.exe" }, { "comment_text": "", "digests": { "md5": "a49f4501f571033d04e13c022aed0cf8", "sha256": "038b4d6773d55bd46af80b4356eab7828df6b6221cf3feca2f44a37d57839c0a" }, "downloads": -1, "filename": "sparkit-learn-0.1.5.tar.gz", "has_sig": false, "md5_digest": "a49f4501f571033d04e13c022aed0cf8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40471, "upload_time": "2015-06-11T15:31:45", "url": "https://files.pythonhosted.org/packages/4f/d4/7d40ced1364fa8c5bb863adfa2df8d399d88554bd6700480c62ae7e86cd4/sparkit-learn-0.1.5.tar.gz" } ], "0.2.3": [], "0.2.4": [ { "comment_text": "", "digests": { "md5": "04b2a6adfd354a0cebf751c635d5900e", "sha256": "0eb4c7f4937a913e6d172c725614b033ccabb325404012ce0282cbb83afb601c" }, "downloads": -1, "filename": "sparkit-learn-0.2.4.tar.gz", "has_sig": false, "md5_digest": "04b2a6adfd354a0cebf751c635d5900e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 56310, "upload_time": "2015-06-17T18:05:25", "url": "https://files.pythonhosted.org/packages/58/7f/d93248f994e2fa760526178985be9fdd28f2a9e74971ff1e4e59e96311cf/sparkit-learn-0.2.4.tar.gz" } ], "0.2.5": [ { "comment_text": "", "digests": { "md5": "1d09226869b3dc58c4496d5017d94bec", "sha256": "2236515f703186cc328fd2dc3b80f9905caf08984d6960df714e46b28b77de7e" }, "downloads": -1, "filename": "sparkit-learn-0.2.5.tar.gz", "has_sig": false, "md5_digest": "1d09226869b3dc58c4496d5017d94bec", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 57269, "upload_time": "2015-06-24T06:59:17", "url": "https://files.pythonhosted.org/packages/f8/77/66be265f9fcbb46a50a5887c8a235dd2f8cb49eb311a25725f21b7f14d49/sparkit-learn-0.2.5.tar.gz" } ], "0.2.6": [ { "comment_text": "", "digests": { "md5": "852214c341ad76fc16b9d4ddec300fd7", "sha256": "d74be2e7600087d248831964c32737b7a9b6f1e0a702a45b6d19b1f3d397bd03" }, "downloads": -1, "filename": "sparkit-learn-0.2.6.tar.gz", "has_sig": false, "md5_digest": "852214c341ad76fc16b9d4ddec300fd7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58394, "upload_time": "2015-06-24T13:19:40", "url": "https://files.pythonhosted.org/packages/ec/19/af25d5939ef5de3860371a4b8d09b50d3c5e1bd1329e78c6160e79fd593e/sparkit-learn-0.2.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "852214c341ad76fc16b9d4ddec300fd7", "sha256": "d74be2e7600087d248831964c32737b7a9b6f1e0a702a45b6d19b1f3d397bd03" }, "downloads": -1, "filename": "sparkit-learn-0.2.6.tar.gz", "has_sig": false, "md5_digest": "852214c341ad76fc16b9d4ddec300fd7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58394, "upload_time": "2015-06-24T13:19:40", "url": "https://files.pythonhosted.org/packages/ec/19/af25d5939ef5de3860371a4b8d09b50d3c5e1bd1329e78c6160e79fd593e/sparkit-learn-0.2.6.tar.gz" } ] }