{ "info": { "author": "Benjamin Habert", "author_email": "bhabert@quantmetry.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# pipeasy-spark\n\n\n[![pypi badge](https://img.shields.io/pypi/v/pipeasy_spark.svg)](https://pypi.python.org/pypi/pipeasy_spark)\n[![](https://img.shields.io/travis/Quantmetry/pipeasy-spark.svg)](https://travis-ci.org/Quantmetry/pipeasy-spark)\n[![documentation badge](https://readthedocs.org/projects/pipeasy-spark/badge/?version=latest)](https://readthedocs.org/projects/pipeasy-spark/)\n[![pyup badge](https://pyup.io/repos/github/BenjaminHabert/pipeasy_spark/shield.svg)](https://pyup.io/repos/github/BenjaminHabert/pipeasy_spark/)\n\n\nan easy way to define preprocessing data pipelines for pysparark\n\n\n* Free software: MIT license\n* Documentation: https://pipeasy-spark.readthedocs.io.\n\nVerified compatibility:\n- python 3.5, 3.6\n- pyspark 2.1 -> 2.4\n\n# Documentation\n\nhttps://pipeasy-spark.readthedocs.io/en/latest/\n\n# Installation\n\n\n```\npip install pipeasy-spark\n```\n\n# Usage\n\nThe goal of this package is to easily create a `pyspark.ml.Pipeline` instance that can\ntransform the columns of a `pyspark.sql.Dataframe` in order to prepare it for a regressor or classifier.\n\n\nAssuming we have the `titanic` dataset as a Dataframe:\n\n```python\ndf = titanic.select('Survived', 'Sex', 'Age').dropna()\ndf.show(2)\n# +--------+------+----+\n# |Survived| Sex| Age|\n# +--------+------+----+\n# | 0| male|22.0|\n# | 1|female|38.0|\n# +--------+------+----+\n```\n\nA basic transformation pipeline can be created as follows. We define for each\ncolumn of the dataframe a list of transformers that are applied sequencially.\nEach transformer is an instance of a transformer from `pyspark.ml.feature`.\nNotice that we do not provide the parameters `inputCol` or `outputCol` to\nthese transformers.\n\n```python\nfrom pipeasy_spark import build_pipeline\nfrom pyspark.ml.feature import (\n StringIndexer,\n OneHotEncoderEstimator,\n VectorAssembler,\n StandardScaler,\n)\n\npipeline = build_pipeline(column_transformers={\n # 'Survived' : this variable is not modified, it can also be omitted from the dict\n 'Survived': [],\n 'Sex': [StringIndexer(), OneHotEncoderEstimator(dropLast=False)],\n # 'Age': a VectorAssembler must be applied before the StandardScaler\n # as the latter only accepts vectors as input.\n 'Age': [VectorAssembler(), StandardScaler()]\n})\ntrained_pipeline = pipeline.fit(df)\ntrained_pipeline.transform(df).show(2)\n# +--------+-------------+--------------------+\n# |Survived| Sex| Age|\n# +--------+-------------+--------------------+\n# | 0|(2,[0],[1.0])|[1.5054181442954726]|\n# | 1|(2,[1],[1.0])| [2.600267703783089]|\n# +--------+-------------+--------------------+\n```\n\nThis preprocessing pipeline can be included in a full prediction pipeline :\n\n```python\nfrom pyspark.ml import Pipeline\nfrom pyspark.ml.classification import LogisticRegression\n\nfull_pipeline = Pipeline(stages=[\n pipeline,\n # all the features have to be assembled in a single column:\n VectorAssembler(inputCols=['Sex', 'Age'], outputCol='features'),\n LogisticRegression(featuresCol='features', labelCol='Survived')\n])\n\ntrained_predictor = full_pipeline.fit(df)\ntrained_predictor.transform(df).show(2)\n# +--------+-------------+--------------------+--------------------+--------------------+--------------------+----------+\n# |Survived| Sex| Age| features| rawPrediction| probability|prediction|\n# +--------+-------------+--------------------+--------------------+--------------------+--------------------+----------+\n# | 0|(2,[0],[1.0])|[1.5054181442954726]|[1.0,0.0,1.505418...|[2.03811507112527...|[0.88474119316485...| 0.0|\n# | 1|(2,[1],[1.0])| [2.600267703783089]|[0.0,1.0,2.600267...|[-0.7360149659890...|[0.32387617489886...| 1.0|\n# +--------+-------------+--------------------+--------------------+--------------------+--------------------+----------+\n```\n\nAs of this writing, these are the transformers from `pyspark.ml.feature` that are supported:\n\n```python\n[\n Binarizer(threshold=0.0, inputCol=None, outputCol=None),\n BucketedRandomProjectionLSH(inputCol=None, outputCol=None, seed=None, numHashTables=1, bucketLength=None),\n Bucketizer(splits=None, inputCol=None, outputCol=None, handleInvalid='error'),\n CountVectorizer(minTF=1.0, minDF=1.0, vocabSize=262144, binary=False, inputCol=None, outputCol=None),\n DCT(inverse=False, inputCol=None, outputCol=None),\n ElementwiseProduct(scalingVec=None, inputCol=None, outputCol=None),\n FeatureHasher(numFeatures=262144, inputCols=None, outputCol=None, categoricalCols=None),\n HashingTF(numFeatures=262144, binary=False, inputCol=None, outputCol=None),\n IDF(minDocFreq=0, inputCol=None, outputCol=None),\n Imputer(strategy='mean', missingValue=nan, inputCols=None, outputCols=None),\n IndexToString(inputCol=None, outputCol=None, labels=None),\n MaxAbsScaler(inputCol=None, outputCol=None),\n MinHashLSH(inputCol=None, outputCol=None, seed=None, numHashTables=1),\n MinMaxScaler(min=0.0, max=1.0, inputCol=None, outputCol=None),\n NGram(n=2, inputCol=None, outputCol=None),\n Normalizer(p=2.0, inputCol=None, outputCol=None),\n OneHotEncoder(dropLast=True, inputCol=None, outputCol=None),\n OneHotEncoderEstimator(inputCols=None, outputCols=None, handleInvalid='error', dropLast=True),\n PCA(k=None, inputCol=None, outputCol=None),\n PolynomialExpansion(degree=2, inputCol=None, outputCol=None),\n QuantileDiscretizer(numBuckets=2, inputCol=None, outputCol=None, relativeError=0.001, handleInvalid='error'),\n RegexTokenizer(minTokenLength=1, gaps=True, pattern='\\\\\\\\s+', inputCol=None, outputCol=None, toLowercase=True),\n StandardScaler(withMean=False, withStd=True, inputCol=None, outputCol=None),\n StopWordsRemover(inputCol=None, outputCol=None, stopWords=None, caseSensitive=False),\n StringIndexer(inputCol=None, outputCol=None, handleInvalid='error', stringOrderType='frequencyDesc'),\n Tokenizer(inputCol=None, outputCol=None),\n VectorAssembler(inputCols=None, outputCol=None),\n VectorIndexer(maxCategories=20, inputCol=None, outputCol=None, handleInvalid='error'),\n VectorSizeHint(inputCol=None, size=None, handleInvalid='error'),\n VectorSlicer(inputCol=None, outputCol=None, indices=None, names=None),\n Word2Vec(vectorSize=100, minCount=5, numPartitions=1, stepSize=0.025, maxIter=1, seed=None, inputCol=None, outputCol=None, windowSize=5, maxSentenceLength=1000)\n]\n```\n\nThese are not supported as it is not possible to specity the input column(s).\n\n```python\n[\n ChiSqSelector(numTopFeatures=50, featuresCol='features', outputCol=None, labelCol='label', selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05),\n LSHParams(),\n RFormula(formula=None, featuresCol='features', labelCol='label', forceIndexLabel=False, stringIndexerOrderType='frequencyDesc', handleInvalid='error'),\n SQLTransformer(statement=None)\n]\n```\n\n\n# Contributing\n\nIn order to contribute to the project, here is how you should configure your local development environment:\n\n- download the code and create a local virtual environment with the required dependencies:\n\n ```\n # getting the project\n $ git clone git@github.com:Quantmetry/pipeasy-spark.git\n $ cd pipeasy-spark\n # creating the virtual environnment and activating it\n $ make install_dev\n $ source .venv/bin/activate\n (.venv) $ make test\n # start the demo notebook\n (.venv) $ make notebook\n ```\n\n **Note:** the `make install` step installs the package in *editable* mode into the local virtual environment.\n This step is required as it also installs the package dependencies as they are listed in the `setup.py` file.\n\n- make sure that Java is correcly installed. [Download](https://www.java.com/en/download/mac_download.jsp)\n Java and install it. Then you should set the `JAVA_HOME` environment variable. For instance you can add the\n following line to your `~/.bash_profile`:\n\n ```\n # ~/.bash_profile -> this depends on your platform\n # note that the actual path might change for you\n export JAVA_HOME=\"/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/\"\n ```\n\n **Note:** notice that we did not install spark itself. Having a valid Java Runtime Engine installed and installing\n `pyspark` (done when installing the package's dependencies) is enough to run the tests.\n\n- run the tests of the project (you need to activate your local virtual environnment):\n\n ```\n $ source .venv/bin/activate\n # this runs the tests/ with the current python version\n (.venv) $ make test\n # check that the developped code follows the standards\n # (we use flake8 as a linting engine, it is configured in the tox.ini file)\n (.venv) $ make lint\n # run the tests for several python versions + run linting step\n (.venv) $ tox\n ```\n\n **Note:** the continuous integration process (run by TravisCI) performs the latest operation (`$ tox`). Consequently you should make sure that this step is successfull on your machine before pushing new code to the repository. However you might not have all python versions installed on your local machine; this is ok in most cases.\n\n- (optional) configure your text editor. Because the tests include a linting step, it is convenient to add this linting to your\n editor. For instance you can use VSCode with the Python extension and add linting with flake8 in the settings.\n It is a good idea to use as a python interpreter for linting (and code completetion etc.) the one in your local virtuel environnment.\n flake8 configuration options are specified in the `tox.ini` file.\n\n\n# Notes on setting up the project\n\n- I setup the project using [this cookiecutter project](https://cookiecutter-pypackage.readthedocs.io/en/latest/readme.html#features)\n- create a local virtual environnment and activate it\n- install requirements_dev.txt\n\n ```\n $ python3 -m venv .venv\n $ source .venv/bin./activate\n (.venv) $ pip install -r requirements_dev.txt\n ```\n\n- I update my vscode config to use this virtual env as the python interpreter for this project ([doc](https://code.visualstudio.com/docs/python/environments#_manually-specify-an-interpreter))\n (the modified file is /.vscode/settings.json)\n- I update the `Makefile` to be able to run tests (this way I don't have to mess with the `PYTHONPATH`):\n\n ```\n # avant\n test: ## run tests quickly with the default Python\n py.test\n\n # modification\n test: ## run tests quickly with the default Python\n python -m pytest tests/\n ```\n\n I can now run (when inside my local virtual environment):\n\n ```\n (.venv) $ make test\n ```\n\n I can also run `tox` which runs the tests agains several python versions:\n\n ```\n (.venv) $ tox\n py27: commands succeeded\n ERROR: py34: InterpreterNotFound: python3.4\n ERROR: py35: InterpreterNotFound: python3.5\n py36: commands succeeded\n flake8: commands succeeded\n ```\n\n- I log into Travis with my Github account. I can see and configure the builds for this repository (I have admin rights on the repo).\n I can trigger a build without pushing to the repository (More Options / Trigger Build). Everything runs fine!\n- I push this to a new branch : Travis triggers tests on this branch (even without creating a pull request).\n The tests fail because I changed `README.rst` to `README.md`. I need to also change this in `setup.py`.\n- I create an account on pypi.org and link it to the current project\n ([documentation](https://cookiecutter-pypackage.readthedocs.io/en/latest/travis_pypi_setup.html#travis-pypi-setup))\n\n ```\n $ brew install travis\n $ travis encrypt ****** --add deploy.password\n ```\n\n This modifies the `.travis.yml` file. I customize it slightly because the package name is wrong:\n\n ```\n # .travis.yml\n deploy:\n on:\n tags: true\n # the repo was not correct:\n repo: Quantmetry/pipeasy-spark\n ```\n\n- I update the version and push a tag:\n\n ```\n $ bumpversion patch\n $ git push --tags\n $ git push\n ```\n\n I can indeed see the tag (and an associated release) on the github interface. However Travis does not deploy\n on this commit. This is the intended behaviour. Be [default](https://docs.travis-ci.com/user/deployment/pypi/)\n travis deploys only on the `master` branch.\n\n- I setup an account on [readthedoc.io](https://readthedocs.org/). Selecting the repo is enough to have the documentation online! \n\n- When I merge to master Travis launches a build bus says it will not deploy\n (see [this build](https://travis-ci.org/Quantmetry/pipeasy-spark/jobs/440637481) for instance). However the library\n was indeed deployed to pypi: I can pip install it..\n\nNote: when Omar pushed new commits, travis does not report their status.\n\n- I update `setup.py` to add the dependence to `pyspark`. I also modify slightly the development setup (see Development section above).\n\n\n=======\nHistory\n=======\n\n0.2.0 (2019-01-06)\n------------------\n\nFirst usable version of the package. We decided on the api:\n\n- ``pipeasy_spark.build_pipeline(column_transformers={'column': []})`` is the core function \n where you can define a list of transormers for each columns.\n- ``pipeasy_spark.build_pipeline_by_dtypes(df, string_transformers=[])`` allows\n you to define a list of transormers for two types of columns: ``string_`` and ``numeric_``.\n- ``pipeasy_spark.build_default_pipeline(df, exclude_columns=['target'])`` builds a default\n transformer for the ``df`` dataframe.\n\n0.1.2 (2018-10-12)\n------------------\n\n* I am still learning how all these tools interact with each other\n\n0.1.1 (2018-10-12)\n------------------\n\n* First release on PyPI.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/BenjaminHabert/pipeasy_spark", "keywords": "pipeasy_spark", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "pipeasy-spark", "package_url": "https://pypi.org/project/pipeasy-spark/", "platform": "", "project_url": "https://pypi.org/project/pipeasy-spark/", "project_urls": { "Homepage": "https://github.com/BenjaminHabert/pipeasy_spark" }, "release_url": "https://pypi.org/project/pipeasy-spark/0.2.1/", "requires_dist": [ "pyspark", "numpy" ], "requires_python": "", "summary": "an easy way to define preprocessing data pipelines for pysparark", "version": "0.2.1" }, "last_serial": 4784734, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "ec6811080c197fd06ab50f1ec4adc38b", "sha256": "847cbe1a4f5a5e51c9a40ce125ed4922388fa94048f134adfe946581d9f778a0" }, "downloads": -1, "filename": "pipeasy_spark-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ec6811080c197fd06ab50f1ec4adc38b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 4006, "upload_time": "2018-10-12T12:14:43", "url": "https://files.pythonhosted.org/packages/f4/8f/594b6ca87c10577b92c535762cdfddc3c80e61dd454c82dc2cf06629d104/pipeasy_spark-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "837b419be2f13d80ebc006fb31050179", "sha256": "e57efa0c84690dce828b8a2ef6b64474ec372ca16b7d259cf6fc8b5368e0b1ea" }, "downloads": -1, "filename": "pipeasy_spark-0.1.1.tar.gz", "has_sig": false, "md5_digest": "837b419be2f13d80ebc006fb31050179", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9057, "upload_time": "2018-10-12T12:14:44", "url": "https://files.pythonhosted.org/packages/2b/6d/125560c59809891cb6467d6e177664669e70324243c97c73e1596d9c4b18/pipeasy_spark-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "ca5f984e35883b02e7738e600fb7e76e", "sha256": "806c64970d39be73c8fcea78201afcbd412c2cfc91328f497ac7b49e8c15bf6d" }, "downloads": -1, "filename": "pipeasy_spark-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ca5f984e35883b02e7738e600fb7e76e", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 4210, "upload_time": "2018-10-12T13:32:45", "url": "https://files.pythonhosted.org/packages/03/8d/302b8c0f8b85303af66014c9a34c34b0ae349a5a0c6c05ba87925176513f/pipeasy_spark-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a2dc270abc660e85e776e34ff630d140", "sha256": "a7790f05afcd1afe34540e74a707b00cabb46903a5a0a718d06b193f444bd871" }, "downloads": -1, "filename": "pipeasy_spark-0.1.2.tar.gz", "has_sig": false, "md5_digest": "a2dc270abc660e85e776e34ff630d140", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9280, "upload_time": "2018-10-12T13:32:46", "url": "https://files.pythonhosted.org/packages/0c/22/83748c24908b0ae3e25bc3e8de5ef9ad1217b1bd23fede781221b88810e7/pipeasy_spark-0.1.2.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "edc41f2a21fbf98b342cd0883c0ce1a1", "sha256": "22ffff88741330c8d9d479436002728c60e94380b2308ea4ceac41901700d402" }, "downloads": -1, "filename": "pipeasy_spark-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "edc41f2a21fbf98b342cd0883c0ce1a1", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 9971, "upload_time": "2019-02-06T02:48:40", "url": "https://files.pythonhosted.org/packages/84/72/cfeacc4ab5f50b3af8c9b5c5b11b043c4c156ecf48bf8db629dab0a65c11/pipeasy_spark-0.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "bafee6fd502ed42968a6b6bcd4d1431a", "sha256": "6fa7de7bc291a9a9405234be1df33719c7b91f3c45574806a82feea3c71350cd" }, "downloads": -1, "filename": "pipeasy_spark-0.2.0.tar.gz", "has_sig": false, "md5_digest": "bafee6fd502ed42968a6b6bcd4d1431a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24108, "upload_time": "2019-02-06T02:48:41", "url": "https://files.pythonhosted.org/packages/10/37/2b86dec45375c61e56209401020075bfa86cd15f34761e025e9ffe4ad254/pipeasy_spark-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "ee96a232c9bc519995ad99433bc84b8c", "sha256": "107a1af0be0265f1a51407fe9d583476d42ac48f7ac73671250304a61e9c6433" }, "downloads": -1, "filename": "pipeasy_spark-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ee96a232c9bc519995ad99433bc84b8c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10191, "upload_time": "2019-02-06T02:58:20", "url": "https://files.pythonhosted.org/packages/c4/51/2d1fdecdf71e7c055036de13aa913ff22949f10bde9b7ed746b44e51595f/pipeasy_spark-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f846eb3594012799ba2f22b9fce599fd", "sha256": "04ef1c1e347f7046685495e2f41763057456226823dcf3b4bd8a87041d64e70e" }, "downloads": -1, "filename": "pipeasy_spark-0.2.1.tar.gz", "has_sig": false, "md5_digest": "f846eb3594012799ba2f22b9fce599fd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24821, "upload_time": "2019-02-06T02:58:21", "url": "https://files.pythonhosted.org/packages/ec/f8/8450de62c2b1118843857eb1482fe9c85cacc85d39f74ce92e9084c0c8f5/pipeasy_spark-0.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ee96a232c9bc519995ad99433bc84b8c", "sha256": "107a1af0be0265f1a51407fe9d583476d42ac48f7ac73671250304a61e9c6433" }, "downloads": -1, "filename": "pipeasy_spark-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ee96a232c9bc519995ad99433bc84b8c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10191, "upload_time": "2019-02-06T02:58:20", "url": "https://files.pythonhosted.org/packages/c4/51/2d1fdecdf71e7c055036de13aa913ff22949f10bde9b7ed746b44e51595f/pipeasy_spark-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f846eb3594012799ba2f22b9fce599fd", "sha256": "04ef1c1e347f7046685495e2f41763057456226823dcf3b4bd8a87041d64e70e" }, "downloads": -1, "filename": "pipeasy_spark-0.2.1.tar.gz", "has_sig": false, "md5_digest": "f846eb3594012799ba2f22b9fce599fd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24821, "upload_time": "2019-02-06T02:58:21", "url": "https://files.pythonhosted.org/packages/ec/f8/8450de62c2b1118843857eb1482fe9c85cacc85d39f74ce92e9084c0c8f5/pipeasy_spark-0.2.1.tar.gz" } ] }