{
    "info": {
        "author": "Krisztian Szucs, Andras Fulop",
        "author_email": "krisztian.szucs@lensa.com, andras.fulop@lensa.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "Sparkit-learn\n=============\n\n|Build Status| |PyPi| |Gitter|\n\n**PySpark + Scikit-learn = Sparkit-learn**\n\nGitHub: https://github.com/lensacom/sparkit-learn\n\nAbout\n=====\n\nSparkit-learn aims to provide scikit-learn functionality and API on\nPySpark. The main goal of the library is to create an API that stays\nclose to sklearn's.\n\nThe driving principle was to *\"Think locally, execute distributively.\"*\nTo accomodate this concept, the basic data block is always an array or a\n(sparse) matrix and the operations are executed on block level.\n\n\nRequirements\n============\n\n-  **Python 2.7.x or 3.4.x**\n-  **Spark[>=1.3.0]**\n-  NumPy[>=1.9.0]\n-  SciPy[>=0.14.0]\n-  Scikit-learn[>=0.16]\n\n\n\nRun IPython from notebooks directory\n====================================\n\n.. code:: bash\n\n    PYTHONPATH=${PYTHONPATH}:.. IPYTHON_OPTS=\"notebook\" ${SPARK_HOME}/bin/pyspark --master local\\[4\\] --driver-memory 2G\n\n\nRun tests with\n==============\n\n.. code:: bash\n\n    ./runtests.sh\n\n\nQuick start\n===========\n\nSparkit-learn introduces three important distributed data format:\n\n-  **ArrayRDD:**\n\n   A *numpy.array* like distributed array\n\n   .. code:: python\n\n       from splearn.rdd import ArrayRDD\n\n       data = range(20)\n       # PySpark RDD with 2 partitions\n       rdd = sc.parallelize(data, 2) # each partition with 10 elements\n       # ArrayRDD\n       # each partition will contain blocks with 5 elements\n       X = ArrayRDD(rdd, bsize=5) # 4 blocks, 2 in each partition\n\n   Basic operations:\n\n   .. code:: python\n\n       len(X) # 20 - number of elements in the whole dataset\n       X.blocks # 4 - number of blocks\n       X.shape # (20,) - the shape of the whole dataset\n\n       X # returns an ArrayRDD\n       # <class 'splearn.rdd.ArrayRDD'> from PythonRDD...\n\n       X.dtype # returns the type of the blocks\n       # numpy.ndarray\n\n       X.collect() # get the dataset\n       # [array([0, 1, 2, 3, 4]),\n       #  array([5, 6, 7, 8, 9]),\n       #  array([10, 11, 12, 13, 14]),\n       #  array([15, 16, 17, 18, 19])]\n\n       X[1].collect() # indexing\n       # [array([5, 6, 7, 8, 9])]\n\n       X[1] # also returns an ArrayRDD!\n\n       X[1::2].collect() # slicing\n       # [array([5, 6, 7, 8, 9]),\n       #  array([15, 16, 17, 18, 19])]\n\n       X[1::2] # returns an ArrayRDD as well\n\n       X.tolist() # returns the dataset as a list\n       # [0, 1, 2, ... 17, 18, 19]\n       X.toarray() # returns the dataset as a numpy.array\n       # array([ 0,  1,  2, ... 17, 18, 19])\n\n       # pyspark.rdd operations will still work\n       X.getNumPartitions() # 2 - number of partitions\n\n\n- **SparseRDD:**\n\n  The sparse counterpart of the *ArrayRDD*, the main difference is that the\n  blocks are sparse matrices. The reason behind this split is to follow the\n  distinction between *numpy.ndarray*s and *scipy.sparse* matrices.\n  Usually the *SparseRDD* is created by *splearn*'s transformators, but one can\n  instantiate too.\n\n  .. code:: python\n\n       # generate a SparseRDD from a text using SparkCountVectorizer\n       from splearn.rdd import SparseRDD\n       from sklearn.feature_extraction.tests.test_text import ALL_FOOD_DOCS\n       ALL_FOOD_DOCS\n       #(u'the pizza pizza beer copyright',\n       # u'the pizza burger beer copyright',\n       # u'the the pizza beer beer copyright',\n       # u'the burger beer beer copyright',\n       # u'the coke burger coke copyright',\n       # u'the coke burger burger',\n       # u'the salad celeri copyright',\n       # u'the salad salad sparkling water copyright',\n       # u'the the celeri celeri copyright',\n       # u'the tomato tomato salad water',\n       # u'the tomato salad water copyright')\n\n       # ArrayRDD created from the raw data\n       X = ArrayRDD(sc.parallelize(ALL_FOOD_DOCS, 4), 2)\n       X.collect()\n       # [array([u'the pizza pizza beer copyright',\n       #         u'the pizza burger beer copyright'], dtype='<U31'),\n       #  array([u'the the pizza beer beer copyright',\n       #         u'the burger beer beer copyright'], dtype='<U33'),\n       #  array([u'the coke burger coke copyright',\n       #         u'the coke burger burger'], dtype='<U30'),\n       #  array([u'the salad celeri copyright',\n       #         u'the salad salad sparkling water copyright'], dtype='<U41'),\n       #  array([u'the the celeri celeri copyright',\n       #         u'the tomato tomato salad water'], dtype='<U31'),\n       #  array([u'the tomato salad water copyright'], dtype='<U32')]\n\n       # Feature extraction executed\n       from splearn.feature_extraction.text import SparkCountVectorizer\n       vect = SparkCountVectorizer()\n       X = vect.fit_transform(X)\n       # and we have a SparseRDD\n       X\n       # <class 'splearn.rdd.SparseRDD'> from PythonRDD...\n\n       # it's type is the scipy.sparse's general parent\n       X.dtype\n       # scipy.sparse.base.spmatrix\n\n       # slicing works just like in ArrayRDDs\n       X[2:4].collect()\n       # [<2x11 sparse matrix of type '<type 'numpy.int64'>'\n       #   with 7 stored elements in Compressed Sparse Row format>,\n       #  <2x11 sparse matrix of type '<type 'numpy.int64'>'\n       #   with 9 stored elements in Compressed Sparse Row format>]\n\n       # general mathematical operations are available\n       X.sum(), X.mean(), X.max(), X.min()\n       # (55, 0.45454545454545453, 2, 0)\n\n       # even with axis parameters provided\n       X.sum(axis=1)\n       # matrix([[5],\n       #         [5],\n       #         [6],\n       #         [5],\n       #         [5],\n       #         [4],\n       #         [4],\n       #         [6],\n       #         [5],\n       #         [5],\n       #         [5]])\n\n       # It can be transformed to dense ArrayRDD\n       X.todense()\n       # <class 'splearn.rdd.ArrayRDD'> from PythonRDD...\n       X.todense().collect()\n       # [array([[1, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0],\n       #         [1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0]]),\n       #  array([[2, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0],\n       #         [2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0]]),\n       #  array([[0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0],\n       #         [0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0]]),\n       #  array([[0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0],\n       #         [0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 1]]),\n       #  array([[0, 0, 2, 0, 1, 0, 0, 0, 2, 0, 0],\n       #         [0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 1]]),\n       #  array([[0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1]])]\n\n       # One can instantiate SparseRDD manually too:\n       sparse = sc.parallelize(np.array([sp.eye(2).tocsr()]*20), 2)\n       sparse = SparseRDD(sparse, bsize=5)\n       sparse\n       # <class 'splearn.rdd.SparseRDD'> from PythonRDD...\n\n       sparse.collect()\n       # [<10x2 sparse matrix of type '<type 'numpy.float64'>'\n       #   with 10 stored elements in Compressed Sparse Row format>,\n       #  <10x2 sparse matrix of type '<type 'numpy.float64'>'\n       #   with 10 stored elements in Compressed Sparse Row format>,\n       #  <10x2 sparse matrix of type '<type 'numpy.float64'>'\n       #   with 10 stored elements in Compressed Sparse Row format>,\n       #  <10x2 sparse matrix of type '<type 'numpy.float64'>'\n       #   with 10 stored elements in Compressed Sparse Row format>]\n\n\n-  **DictRDD:**\n\n   A column based data format, each column with it's own type.\n\n   .. code:: python\n\n       from splearn.rdd import DictRDD\n\n       X = range(20)\n       y = list(range(2)) * 10\n       # PySpark RDD with 2 partitions\n       X_rdd = sc.parallelize(X, 2) # each partition with 10 elements\n       y_rdd = sc.parallelize(y, 2) # each partition with 10 elements\n       # DictRDD\n       # each partition will contain blocks with 5 elements\n       Z = DictRDD((X_rdd, y_rdd),\n                   columns=('X', 'y'),\n                   bsize=5,\n                   dtype=[np.ndarray, np.ndarray]) # 4 blocks, 2/partition\n       # if no dtype is provided, the type of the blocks will be determined\n       # automatically\n\n       # or:\n       import numpy as np\n\n       data = np.array([range(20), list(range(2))*10]).T\n       rdd = sc.parallelize(data, 2)\n       Z = DictRDD(rdd,\n                   columns=('X', 'y'),\n                   bsize=5,\n                   dtype=[np.ndarray, np.ndarray])\n\n   Basic operations:\n\n   .. code:: python\n\n       len(Z) # 8 - number of blocks\n       Z.columns # returns ('X', 'y')\n       Z.dtype # returns the types in correct order\n       # [numpy.ndarray, numpy.ndarray]\n\n       Z # returns a DictRDD\n       #<class 'splearn.rdd.DictRDD'> from PythonRDD...\n\n       Z.collect()\n       # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n       #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n       #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0])),\n       #  (array([15, 16, 17, 18, 19]), array([1, 0, 1, 0, 1]))]\n\n       Z[:, 'y'] # column select - returns an ArrayRDD\n       Z[:, 'y'].collect()\n       # [array([0, 1, 0, 1, 0]),\n       #  array([1, 0, 1, 0, 1]),\n       #  array([0, 1, 0, 1, 0]),\n       #  array([1, 0, 1, 0, 1])]\n\n       Z[:-1, ['X', 'y']] # slicing - DictRDD\n       Z[:-1, ['X', 'y']].collect()\n       # [(array([0, 1, 2, 3, 4]), array([0, 1, 0, 1, 0])),\n       #  (array([5, 6, 7, 8, 9]), array([1, 0, 1, 0, 1])),\n       #  (array([10, 11, 12, 13, 14]), array([0, 1, 0, 1, 0]))]\n\n\nBasic workflow\n--------------\n\nWith the use of the described data structures, the basic workflow is\nalmost identical to sklearn's.\n\nDistributed vectorizing of texts\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nSparkCountVectorizer\n^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkCountVectorizer\n    from sklearn.feature_extraction.text import CountVectorizer\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local = CountVectorizer()\n    dist = SparkCountVectorizer()\n\n    result_local = local.fit_transform(X)\n    result_dist = dist.fit_transform(X_rdd)  # SparseRDD\n\n\nSparkHashingVectorizer\n^^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from sklearn.feature_extraction.text import HashingVectorizer\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local = HashingVectorizer()\n    dist = SparkHashingVectorizer()\n\n    result_local = local.fit_transform(X)\n    result_dist = dist.fit_transform(X_rdd)  # SparseRDD\n\n\nSparkTfidfTransformer\n^^^^^^^^^^^^^^^^^^^^^\n\n.. code:: python\n\n    from splearn.rdd import ArrayRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from splearn.feature_extraction.text import SparkTfidfTransformer\n    from splearn.pipeline import SparkPipeline\n\n    from sklearn.feature_extraction.text import HashingVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.pipeline import Pipeline\n\n    X = [...]  # list of texts\n    X_rdd = ArrayRDD(sc.parallelize(X, 4))  # sc is SparkContext\n\n    local_pipeline = Pipeline((\n        ('vect', HashingVectorizer()),\n        ('tfidf', TfidfTransformer())\n    ))\n    dist_pipeline = SparkPipeline((\n        ('vect', SparkHashingVectorizer()),\n        ('tfidf', SparkTfidfTransformer())\n    ))\n\n    result_local = local_pipeline.fit_transform(X)\n    result_dist = dist_pipeline.fit_transform(X_rdd)  # SparseRDD\n\n\nDistributed Classifiers\n~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from splearn.rdd import DictRDD\n    from splearn.feature_extraction.text import SparkHashingVectorizer\n    from splearn.feature_extraction.text import SparkTfidfTransformer\n    from splearn.svm import SparkLinearSVC\n    from splearn.pipeline import SparkPipeline\n\n    from sklearn.feature_extraction.text import HashingVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.svm import LinearSVC\n    from sklearn.pipeline import Pipeline\n\n    X = [...]  # list of texts\n    y = [...]  # list of labels\n    X_rdd = sc.parallelize(X, 4)\n    y_rdd = sc.parralelize(y, 4)\n    Z = DictRDD((X_rdd, y_rdd),\n                columns=('X', 'y'),\n                dtype=[np.ndarray, np.ndarray])\n\n    local_pipeline = Pipeline((\n        ('vect', HashingVectorizer()),\n        ('tfidf', TfidfTransformer()),\n        ('clf', LinearSVC())\n    ))\n    dist_pipeline = SparkPipeline((\n        ('vect', SparkHashingVectorizer()),\n        ('tfidf', SparkTfidfTransformer()),\n        ('clf', SparkLinearSVC())\n    ))\n\n    local_pipeline.fit(X, y)\n    dist_pipeline.fit(Z, clf__classes=np.unique(y))\n\n    y_pred_local = local_pipeline.predict(X)\n    y_pred_dist = dist_pipeline.predict(Z[:, 'X'])\n\n\nDistributed Model Selection\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from splearn.rdd import DictRDD\n    from splearn.grid_search import SparkGridSearchCV\n    from splearn.naive_bayes import SparkMultinomialNB\n\n    from sklearn.grid_search import GridSearchCV\n    from sklearn.naive_bayes import MultinomialNB\n\n    X = [...]\n    y = [...]\n    X_rdd = sc.parallelize(X, 4)\n    y_rdd = sc.parralelize(y, 4)\n    Z = DictRDD((X_rdd, y_rdd),\n                columns=('X', 'y'),\n                dtype=[np.ndarray, np.ndarray])\n\n    parameters = {'alpha': [0.1, 1, 10]}\n    fit_params = {'classes': np.unique(y)}\n\n    local_estimator = MultinomialNB()\n    local_grid = GridSearchCV(estimator=local_estimator,\n                              param_grid=parameters)\n\n    estimator = SparkMultinomialNB()\n    grid = SparkGridSearchCV(estimator=estimator,\n                             param_grid=parameters,\n                             fit_params=fit_params)\n\n    local_grid.fit(X, y)\n    grid.fit(Z)\n\n\nSpecial thanks\n==============\n\n- scikit-learn community\n- spylearn community\n- pyspark community\n\n\n.. |Build Status| image:: https://travis-ci.org/lensacom/sparkit-learn.png?branch=master\n   :target: https://travis-ci.org/lensacom/sparkit-learn\n.. |PyPi| image:: https://img.shields.io/pypi/v/sparkit-learn.svg\n   :target: https://pypi.python.org/pypi/sparkit-learn\n.. |Gitter| image:: https://badges.gitter.im/Join%20Chat.svg\n   :alt: Join the chat at https://gitter.im/lensacom/sparkit-learn\n   :target: https://gitter.im/lensacom/sparkit-learn?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/lensacom/sparkit-learn",
        "keywords": null,
        "license": "Apache License, Version 2.0",
        "maintainer": null,
        "maintainer_email": null,
        "name": "sparkit-learn",
        "package_url": "https://pypi.org/project/sparkit-learn/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/sparkit-learn/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/lensacom/sparkit-learn"
        },
        "release_url": "https://pypi.org/project/sparkit-learn/0.2.6/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "Scikit-learn on PySpark",
        "version": "0.2.6"
    },
    "last_serial": 1604707,
    "releases": {
        "0.1.5": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "48b9ab40b50d13d46fed896539deb0d5",
                    "sha256": "30e2672713b0b7acb57699f5c46e4902ce0ef62955452d8cfd9c64fd6ba2d9a9"
                },
                "downloads": -1,
                "filename": "sparkit-learn-0.1.5.macosx-10.10-intel.exe",
                "has_sig": false,
                "md5_digest": "48b9ab40b50d13d46fed896539deb0d5",
                "packagetype": "bdist_wininst",
                "python_version": "any",
                "requires_python": null,
                "size": 116528,
                "upload_time": "2015-06-11T15:31:49",
                "url": "https://files.pythonhosted.org/packages/0f/ea/24fd989c913d10f0e669d51e17c28d3c0de88665399b81cbbdb9642b5934/sparkit-learn-0.1.5.macosx-10.10-intel.exe"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "a49f4501f571033d04e13c022aed0cf8",
                    "sha256": "038b4d6773d55bd46af80b4356eab7828df6b6221cf3feca2f44a37d57839c0a"
                },
                "downloads": -1,
                "filename": "sparkit-learn-0.1.5.tar.gz",
                "has_sig": false,
                "md5_digest": "a49f4501f571033d04e13c022aed0cf8",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 40471,
                "upload_time": "2015-06-11T15:31:45",
                "url": "https://files.pythonhosted.org/packages/4f/d4/7d40ced1364fa8c5bb863adfa2df8d399d88554bd6700480c62ae7e86cd4/sparkit-learn-0.1.5.tar.gz"
            }
        ],
        "0.2.3": [],
        "0.2.4": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "04b2a6adfd354a0cebf751c635d5900e",
                    "sha256": "0eb4c7f4937a913e6d172c725614b033ccabb325404012ce0282cbb83afb601c"
                },
                "downloads": -1,
                "filename": "sparkit-learn-0.2.4.tar.gz",
                "has_sig": false,
                "md5_digest": "04b2a6adfd354a0cebf751c635d5900e",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 56310,
                "upload_time": "2015-06-17T18:05:25",
                "url": "https://files.pythonhosted.org/packages/58/7f/d93248f994e2fa760526178985be9fdd28f2a9e74971ff1e4e59e96311cf/sparkit-learn-0.2.4.tar.gz"
            }
        ],
        "0.2.5": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "1d09226869b3dc58c4496d5017d94bec",
                    "sha256": "2236515f703186cc328fd2dc3b80f9905caf08984d6960df714e46b28b77de7e"
                },
                "downloads": -1,
                "filename": "sparkit-learn-0.2.5.tar.gz",
                "has_sig": false,
                "md5_digest": "1d09226869b3dc58c4496d5017d94bec",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 57269,
                "upload_time": "2015-06-24T06:59:17",
                "url": "https://files.pythonhosted.org/packages/f8/77/66be265f9fcbb46a50a5887c8a235dd2f8cb49eb311a25725f21b7f14d49/sparkit-learn-0.2.5.tar.gz"
            }
        ],
        "0.2.6": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "852214c341ad76fc16b9d4ddec300fd7",
                    "sha256": "d74be2e7600087d248831964c32737b7a9b6f1e0a702a45b6d19b1f3d397bd03"
                },
                "downloads": -1,
                "filename": "sparkit-learn-0.2.6.tar.gz",
                "has_sig": false,
                "md5_digest": "852214c341ad76fc16b9d4ddec300fd7",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 58394,
                "upload_time": "2015-06-24T13:19:40",
                "url": "https://files.pythonhosted.org/packages/ec/19/af25d5939ef5de3860371a4b8d09b50d3c5e1bd1329e78c6160e79fd593e/sparkit-learn-0.2.6.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "852214c341ad76fc16b9d4ddec300fd7",
                "sha256": "d74be2e7600087d248831964c32737b7a9b6f1e0a702a45b6d19b1f3d397bd03"
            },
            "downloads": -1,
            "filename": "sparkit-learn-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "852214c341ad76fc16b9d4ddec300fd7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 58394,
            "upload_time": "2015-06-24T13:19:40",
            "url": "https://files.pythonhosted.org/packages/ec/19/af25d5939ef5de3860371a4b8d09b50d3c5e1bd1329e78c6160e79fd593e/sparkit-learn-0.2.6.tar.gz"
        }
    ]
}