{ "info": { "author": "Ibotta Inc.", "author_email": "machine_learning@ibotta.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Programming Language :: Python", "Topic :: Scientific/Engineering" ], "description": "sk-dist: Distributed scikit-learn meta-estimators in PySpark\n============================================================\n\n|License| |Build Status| |PyPI Package|\n\nWhat is it?\n-----------\n\n``sk-dist`` is a Python package for machine learning built on top of\n`scikit-learn `__ and is\ndistributed under the `Apache 2.0 software\nlicense `__. The\n``sk-dist`` module can be thought of as \"distributed scikit-learn\" as\nits core functionality is to extend the ``scikit-learn`` built-in\n``joblib`` parallelization of meta-estimator training to\n`spark `__. A popular use case is the \nparallelization of grid search as shown here:\n\n\nCheck out the `blog post `__ \nfor more information on the motivation and use cases of ``sk-dist``.\n\nMain Features\n-------------\n\n- **Distributed Training** - ``sk-dist`` parallelizes the training of\n ``scikit-learn`` meta-estimators with PySpark. This allows\n distributed training of these estimators without any constraint on\n the physical resources of any one machine. In all cases, spark\n artifacts are automatically stripped from the fitted estimator. These\n estimators can then be pickled and un-pickled for prediction tasks,\n operating identically at predict time to their ``scikit-learn``\n counterparts. Supported tasks are:\n\n - *Grid Search*: `Hyperparameter optimization\n techniques `__,\n particularly\n `GridSearchCV `__\n and\n `RandomizedSeachCV `__,\n are distributed such that each parameter set candidate is trained\n in parallel.\n - *Multiclass Strategies*: `Multiclass classification\n strategies `__,\n particularly\n `OneVsRestClassifier `__\n and\n `OneVsOneClassifier `__,\n are distributed such that each binary probelm is trained in\n parallel.\n - *Tree Ensembles*: `Decision tree\n ensembles `__\n for classification and regression, particularly\n `RandomForest `__\n and\n `ExtraTrees `__,\n are distributed such that each tree is trained in parallel.\n\n- **Distributed Prediction** - ``sk-dist`` provides a prediction module\n which builds `vectorized\n UDFs `__\n for\n `PySpark `__\n `DataFrames `__\n using fitted ``scikit-learn`` estimators. This distributes the\n ``predict`` and ``predict_proba`` methods of ``scikit-learn``\n estimators, enabling large scale prediction with ``scikit-learn``.\n- **Feature Encoding** - ``sk-dist`` provides a flexible feature\n encoding utility called ``Encoderizer`` which encodes mix-typed\n feature spaces using either default behavior or user defined\n customizable settings. It is particularly aimed at text features, but\n it additionally handles numeric and dictionary type feature spaces.\n\nInstallation\n------------\n\nDependencies\n~~~~~~~~~~~~\n\n``sk-dist`` requires:\n\n- `Python `__ (>= 3.5)\n- `pandas `__ (>=0.19.0)\n- `numpy `__ (>=1.17.0)\n- `scipy `__ (>=1.3.1)\n- `scikit-learn `__ (>=0.21.3)\n- `joblib `__ (>=0.11)\n\nsk-dist does not support Python 2\n\nSpark Dependencies\n~~~~~~~~~~~~~~~~~~\n\nMost ``sk-dist`` functionality requires a spark installation as well as\nPySpark. Some functionality can run without spark, so spark related\ndependencies are not required. The connection between sk-dist and spark\nrelies solely on a ``sparkContext`` as an argument to various\n``sk-dist`` classes upon instantiation.\n\nA variety of spark configurations and setups will work. It is left up to\nthe user to configure their own spark setup. The testing suite runs\n``spark 2.3`` and ``spark 2.4``, though any ``spark 2.0+`` versions \nare expected to work.\n\nAdditional spark related dependecies are ``pyarrow``, which is used only\nfor ``skdist.predict`` functions. This uses vectorized pandas UDFs which\nrequire ``pyarrow>=0.8.0``. Depending on the spark version, it may be\nnecessary to set\n``spark.conf.set(\"spark.sql.execution.arrow.enabled\", \"true\")`` in the\nspark configuration.\n\nUser Installation\n~~~~~~~~~~~~~~~~~\n\nThe easiest way to install ``sk-dist`` is with ``pip``:\n\n::\n\n pip install --upgrade sk-dist\n\nYou can also download the source code:\n\n::\n\n git clone https://github.com/Ibotta/sk-dist.git\n\nTesting\n~~~~~~~\n\nWith ``pytest`` installed, you can run tests locally:\n\n::\n\n pytest sk-dist\n\nExamples\n--------\n\nThe package contains numerous \n`examples `__ \non how to use ``sk-dist`` in practice. Examples of note are:\n\n- `Grid Search with XGBoost `__\n- `Spark ML Benchmark Comparison `__\n- `Encoderizer with 20 Newsgroups `__\n- `One-Vs-Rest vs One-Vs-One `__\n- `Large Scale Sklearn Prediction with PySpark UDFs `_\n\nGradient Boosting\n-----------------\n\n``sk-dist`` has been tested with a number of popular gradient boosting packages that conform to the ``scikit-learn`` API. This \nincludes ``xgboost`` and ``catboost``. These will need to be installed in addition to ``sk-dist`` on all nodes of the spark \ncluster via a node bootstrap script. Version compatibility is left up to the user.\n\nSupport for ``lightgbm`` is not guaranteed, as it requires `additional installations `__ on all \nnodes of the spark cluster. This may work given proper installation but has not beed tested with ``sk-dist``.\n\nBackground\n----------\n\nThe project was started at `Ibotta\nInc. `__ on the machine learning\nteam and open sourced in 2019.\n\nIt is currently maintained by the machine learning team at Ibotta. Special\nthanks to those who contributed to ``sk-dist`` while it was initially\nin development at Ibotta:\n\n- `Evan Harris `__\n- `Nicole Woytarowicz `__\n- `Mike Lewis `__\n- `Bobby Crimi `__\n\n\n.. |License| image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg\n :target: https://opensource.org/licenses/Apache-2.0\n.. |Build Status| image:: https://travis-ci.org/Ibotta/sk-dist.png?branch=master\n :target: https://travis-ci.org/Ibotta/sk-dist\n.. |PyPI Package| image:: https://badge.fury.io/py/sk-dist.svg\n :target: https://pypi.org/project/sk-dist/", "description_content_type": "", "docs_url": null, "download_url": "https://pypi.org/project/sk-dist/#files", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "sk-dist", "package_url": "https://pypi.org/project/sk-dist/", "platform": "", "project_url": "https://pypi.org/project/sk-dist/", "project_urls": { "Download": "https://pypi.org/project/sk-dist/#files", "Source Code": "https://github.com/Ibotta/sk-dist" }, "release_url": "https://pypi.org/project/sk-dist/0.1.5/", "requires_dist": null, "requires_python": ">=3.5", "summary": "Distributed scikit-learn meta-estimators with PySpark", "version": "0.1.5" }, "last_serial": 5951568, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "d535c9583b4302a413c4c0b1833fa734", "sha256": "57fcc269f71263dca7053232583f802893ef838dc45c5584cefd091ef470ca04" }, "downloads": -1, "filename": "sk-dist-0.1.0.tar.gz", "has_sig": false, "md5_digest": "d535c9583b4302a413c4c0b1833fa734", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 30510, "upload_time": "2019-08-28T15:51:50", "url": "https://files.pythonhosted.org/packages/de/42/55b1347436452cf2bb68c2fa411aa346f40323de8e4f2b3bdda3b4b614bd/sk-dist-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "d967dcbc6bdca4d561b97cb55b9a2e4f", "sha256": "df75389b9dd06ec2f6384073e27e19b0e372be6f2a1e96d607ba9e73ddedb908" }, "downloads": -1, "filename": "sk-dist-0.1.1.tar.gz", "has_sig": false, "md5_digest": "d967dcbc6bdca4d561b97cb55b9a2e4f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 33186, "upload_time": "2019-09-16T16:05:28", "url": "https://files.pythonhosted.org/packages/85/29/b4b6e3c4008db35e966b91002686c63022f610aa2078f459dfa02522a132/sk-dist-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "e01317ba15977a3850fec387cb7a266c", "sha256": "6dc9d049c7cc598a89ae505c1aa2ad5e6759fd4fbb3944856188a14d40bed553" }, "downloads": -1, "filename": "sk-dist-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e01317ba15977a3850fec387cb7a266c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 34482, "upload_time": "2019-09-17T22:47:30", "url": "https://files.pythonhosted.org/packages/aa/aa/c691edb73201430bb1a4f371e0dcbb7492fea788ce5e9a7f62a5a59b24ff/sk-dist-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "93bad74aa9d11bf9919c9a6bda494354", "sha256": "bed78afd84ce81238fb35d28db367c9a0df661e0413cb448cf87550b354517d4" }, "downloads": -1, "filename": "sk-dist-0.1.3.tar.gz", "has_sig": false, "md5_digest": "93bad74aa9d11bf9919c9a6bda494354", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 38544, "upload_time": "2019-10-07T16:57:30", "url": "https://files.pythonhosted.org/packages/b4/a9/3dd403721701aa37550cf619f4cedfaa3da87bcb8abe00cb40050ecdd24a/sk-dist-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "9e6b986f928b7cca101c33a3a62d3518", "sha256": "2ac7db394e1ff5a405a2933104d2d5376b7990fd937503970074a1596bacca73" }, "downloads": -1, "filename": "sk-dist-0.1.4.tar.gz", "has_sig": false, "md5_digest": "9e6b986f928b7cca101c33a3a62d3518", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 38740, "upload_time": "2019-10-07T21:46:23", "url": "https://files.pythonhosted.org/packages/dc/e5/73f91dc489b7b1336c00bcca25f8a4b3e8e41457b6ddf61671f734e52875/sk-dist-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "5c5db9ac4d460a2dd6b817c5aaa804b5", "sha256": "c642347a4fb9b941a45aa97b05fafe7fbf2abf79937d2e19564c094923610cdf" }, "downloads": -1, "filename": "sk-dist-0.1.5.tar.gz", "has_sig": false, "md5_digest": "5c5db9ac4d460a2dd6b817c5aaa804b5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 40849, "upload_time": "2019-10-09T19:21:47", "url": "https://files.pythonhosted.org/packages/be/4d/0d9ffb4ce3b43de3f0db2fc03c7e8d664285f537d4b657c86805aef8578d/sk-dist-0.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5c5db9ac4d460a2dd6b817c5aaa804b5", "sha256": "c642347a4fb9b941a45aa97b05fafe7fbf2abf79937d2e19564c094923610cdf" }, "downloads": -1, "filename": "sk-dist-0.1.5.tar.gz", "has_sig": false, "md5_digest": "5c5db9ac4d460a2dd6b817c5aaa804b5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 40849, "upload_time": "2019-10-09T19:21:47", "url": "https://files.pythonhosted.org/packages/be/4d/0d9ffb4ce3b43de3f0db2fc03c7e8d664285f537d4b657c86805aef8578d/sk-dist-0.1.5.tar.gz" } ] }