{ "info": { "author": "Derek Miller", "author_email": "dmmiller612@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# SparkFlow\n\nThis is an implementation of TensorFlow on Spark. The goal of this library is to provide a simple, understandable interface \nin using TensorFlow on Spark. With SparkFlow, you can easily integrate your deep learning model with a ML Spark Pipeline.\nUnderneath, SparkFlow uses a parameter server to train the TensorFlow network in a distributed manner. Through the api,\nthe user can specify the style of training, whether that is Hogwild or async with locking.\n\n[![Build Status](https://api.travis-ci.org/lifeomic/sparkflow.svg?branch=master)](https://travis-ci.org/lifeomic/sparkflow)\n[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/lifeomic/sparkflow/blob/master/LICENSE)\n\n## Why should I use this?\nWhile there are other libraries that use TensorFlow on Apache Spark, SparkFlow's objective is to work seamlessly \nwith ML Pipelines, provide a simple interface for training TensorFlow graphs, and give basic abstractions for \nfaster development. For training, SparkFlow uses a parameter server which lives on the driver and allows for asynchronous training. This tool \nprovides faster training time when using big data.\n\n## Installation\n\nInstall SparkFlow via pip: `pip install sparkflow`\n\nSparkFlow requires Apache Spark >= 2.0, flask, dill, and TensorFlow to be installed. As of sparkflow >= 0.7.0, only \npython >= 3.5 will be supported.\n\n\n## Example\n\n#### Simple MNIST Deep Learning Example\n\n```python\nfrom sparkflow.graph_utils import build_graph\nfrom sparkflow.tensorflow_async import SparkAsyncDL\nimport tensorflow as tf\nfrom pyspark.ml.feature import VectorAssembler, OneHotEncoder\nfrom pyspark.ml.pipeline import Pipeline\n \n#simple tensorflow network\ndef small_model():\n x = tf.placeholder(tf.float32, shape=[None, 784], name='x')\n y = tf.placeholder(tf.float32, shape=[None, 10], name='y')\n layer1 = tf.layers.dense(x, 256, activation=tf.nn.relu)\n layer2 = tf.layers.dense(layer1, 256, activation=tf.nn.relu)\n out = tf.layers.dense(layer2, 10)\n z = tf.argmax(out, 1, name='out')\n loss = tf.losses.softmax_cross_entropy(y, out)\n return loss\n \ndf = spark.read.option(\"inferSchema\", \"true\").csv('mnist_train.csv')\nmg = build_graph(small_model)\n#Assemble and one hot encode\nva = VectorAssembler(inputCols=df.columns[1:785], outputCol='features')\nencoded = OneHotEncoder(inputCol='_c0', outputCol='labels', dropLast=False)\n\nspark_model = SparkAsyncDL(\n inputCol='features',\n tensorflowGraph=mg,\n tfInput='x:0',\n tfLabel='y:0',\n tfOutput='out:0',\n tfLearningRate=.001,\n iters=20,\n predictionCol='predicted',\n labelCol='labels',\n verbose=1\n)\n\np = Pipeline(stages=[va, encoded, spark_model]).fit(df)\np.write().overwrite().save(\"location\")\n``` \n\nFor a couple more, visit the examples directory. These examples can be run with Docker as well from the provided Dockerfile and \nMakefile. This can be done with the following command:\n\n```bash\nmake docker-build\nmake docker-run-dnn\n```\n\nOnce built, there are also commands to run the example CNN and an autoencoder.\n\n\n## Documentation\n\n#### Saving and Loading Pipelines\n\nSince saving and loading custom ML Transformers in pure python has not been implemented in PySpark, an extension has been\nadded here to make that possible. In order to save a Pyspark Pipeline with Apache Spark, one will need to use the overwrite function:\n\n```python\np = Pipeline(stages=[va, encoded, spark_model]).fit(df)\np.write().overwrite().save(\"location\")\n```\n\nFor loading, a Pipeline wrapper has been provided in the pipeline_utils file. An example is below:\n\n```python\nfrom sparkflow.pipeline_util import PysparkPipelineWrapper\nfrom pyspark.ml.pipeline import PipelineModel\n\np = PysparkPipelineWrapper.unwrap(PipelineModel.load('location'))\n``` \nThen you can perform predictions, etc with:\n\n```python\npredictions = p.transform(df)\n```\n\n#### Serializing Tensorflow Graph for SparkAsyncDL\n\nYou may have already noticed the build_graph function in the example above. This serializes the Tensorflow graph for training on Spark.\nThe build_graph function only takes one parameter, which is a function that should include the Tensorflow variables.\nBelow is an example Tensorflow graph function:\n\n```python\n\n\ndef small_model():\n x = tf.placeholder(tf.float32, shape=[None, 784], name='x')\n y = tf.placeholder(tf.float32, shape=[None, 10], name='y')\n layer1 = tf.layers.dense(x, 256, activation=tf.nn.relu)\n layer2 = tf.layers.dense(layer1, 256, activation=tf.nn.relu)\n out = tf.layers.dense(layer2, 10)\n z = tf.argmax(out, 1, name='out')\n loss = tf.losses.softmax_cross_entropy(y, out)\n return loss\n```\n\nThen to use the build_graph function:\n\n```python\nfrom sparkflow.graph_utils import build_graph\nmg = build_graph(small_model)\n```\n\n#### Using SparkAsyncDL and Options\n\nSparkAsyncDL has a few options that one can use for training. Not all of the parameters are required. Below is a description \nof each of the parameters:\n\n```\ninputCol: Spark dataframe inputCol. Similar to other spark ml inputCols\ntensorflowGraph: The protobuf tensorflow graph. You can use the utility function in graph_utils to generate the graph for you\ntfInput: The tensorflow input. This points us to the input variable name that you would like to use for training\ntfLabel: The tensorflow label. This is the variable name for the label.\ntfOutput: The tensorflow raw output. This is for your loss function.\ntfOptimizer: The optimization function you would like to use for training. Defaults to adam\ntfLearningRate: Learning rate of the optimization function\niters: number of iterations of training\npredictionCol: The prediction column name on the spark dataframe for transformations\npartitions: Number of partitions to use for training (recommended on partition per instance)\nminiBatchSize: size of the mini batch. A size of -1 means train on all rows\nminiStochasticIters: If using a mini batch, you can choose number of mini iters you would like to do with the batch size above per epoch. A value of -1 means that you would like to run mini-batches on all data in the partition\nacquireLock: If you do not want to utilize hogwild training, this will set a lock\nshufflePerIter: Specifies if you want to shuffle the features after each iteration\ntfDropout: Specifies the dropout variable. This is important for predictions\ntoKeepDropout: Due to conflicting TF implementations, this specifies whether the dropout function means to keep a percentage of values or to drop a percentage of values.\nverbose: Specifies log level of training results\nlabelCol: Label column for training\npartitionShuffles: This will shuffle your data after iterations are completed, then run again. For example,\nif you have 2 partition shuffles and 100 iterations, it will run 100 iterations then reshuffle and run 100 iterations again.\nThe repartition hits performance and should be used with care.\noptimizerOptions: Json options to apply to tensorflow optimizers.\n```\n\n#### Optimization Configuration\n\nAs of SparkFlow version 0.2.1, TensorFlow optimization configuration options can be added to SparkAsyncDL for more control \nover the optimizer. While the user can supply the configuration json directly, there are a few provided utility \nfunctions that include the parameters necessary. An example is provided below.\n\n```python\n\nfrom sparkflow.graph_utils import build_adam_config\n\n\nadam_config = build_adam_config(learning_rate=0.001, beta1=0.9, beta2=0.999)\nspark_model = SparkAsyncDL(\n ...,\n optimizerOptions=adam_config\n)\n```\n\n#### Loading pre-trained Tensorflow model\n\nTo load a pre-trained Tensorflow model and use it as a spark pipeline, it can be achieved using the following code:\n\n```python\nfrom sparkflow.tensorflow_model_loader import load_tensorflow_model\n\ndf = spark.read.parquet(\"data\")\nloaded_model = load_tensorflow_model(\n path=\"./test_model/to_load\",\n inputCol=\"features\",\n tfInput=\"x:0\",\n tfOutput=\"out/Sigmoid:0\"\n)\ndata_with_predictions = loaded_model.transform(df)\n```\n\n\n## Running\n\nOne big thing to remember, especially for larger networks, is to add the `--executor cores 1` option to spark to ensure\neach instance is only training one copy of the network. This will especially be needed for gpu training as well.\n\n\n## Contributing\n\nContributions are always welcome. This could be fixing a bug, changing documentation, or adding a new feature. To test \nnew changes against existing tests, we have provided a Docker container which takes in an argument of the python version. \nThis allows the user to check their work before pushing to Github, where travis-ci will run.\n\nFor 2.7 (sparkflow <= 0.6.0):\n```\ndocker build -t local-test --build-arg PYTHON_VERSION=2.7 .\ndocker run --rm local-test:latest bash -i -c \"python tests/dl_runner.py\"\n```\n\nFor 3.6\n```\ndocker build -t local-test --build-arg PYTHON_VERSION=3.6 .\ndocker run --rm local-test:latest bash -i -c \"python tests/dl_runner.py\"\n```\n\n\n## Future planned features \n\n* Hyperopt implementation for smaller and larger datasets\n* AWS EMR guides\n\n\n## Literature and Inspiration\n\n* HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent: https://arxiv.org/pdf/1106.5730.pdf\n* Elephas: https://github.com/maxpumperla/elephas\n* Scaling Distributed Machine Learning with the Parameter Server: https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/lifeomic/sparkflow/archive/0.7.0.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/lifeomic/sparkflow", "keywords": "tensorflow,spark,sparkflow,machine learning,lifeomic,deep learning", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "sparkflow", "package_url": "https://pypi.org/project/sparkflow/", "platform": "", "project_url": "https://pypi.org/project/sparkflow/", "project_urls": { "Download": "https://github.com/lifeomic/sparkflow/archive/0.7.0.tar.gz", "Homepage": "https://github.com/lifeomic/sparkflow" }, "release_url": "https://pypi.org/project/sparkflow/0.7.0/", "requires_dist": null, "requires_python": "", "summary": "Deep learning on Spark with Tensorflow", "version": "0.7.0" }, "last_serial": 5285694, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "c0345e1aaa6c8374efffe6ced1e516f9", "sha256": "323d58e16ba0def84d23daedb1f093e38349850903e76bb44a8718c91a6bff71" }, "downloads": -1, "filename": "sparkflow-0.1.tar.gz", "has_sig": false, "md5_digest": "c0345e1aaa6c8374efffe6ced1e516f9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10543, "upload_time": "2018-04-07T16:48:32", "url": "https://files.pythonhosted.org/packages/fd/73/88c0ce09c22afcfaba711e2d22e904c8add2f85187b3a5e071d5c1eb20ff/sparkflow-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "51a9f76f57e4690a1014a593c327dd1a", "sha256": "394123877c5c5920094bccfc8bffb6cd266e94819dd83fc11d4ab69ba70d3a35" }, "downloads": -1, "filename": "sparkflow-0.1.1.tar.gz", "has_sig": false, "md5_digest": "51a9f76f57e4690a1014a593c327dd1a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12019, "upload_time": "2018-04-08T01:06:55", "url": "https://files.pythonhosted.org/packages/0d/4a/b0044eca167fa294443723954d3db33cc921df04e95268a6756628c8d0f3/sparkflow-0.1.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "9a5e8c72fbfe1c32a78cac2ec7a1a946", "sha256": "531572babcedcb2bff8e445b6e4c69e63cfc561055ce7533418725c83b030a66" }, "downloads": -1, "filename": "sparkflow-0.2.tar.gz", "has_sig": false, "md5_digest": "9a5e8c72fbfe1c32a78cac2ec7a1a946", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10312, "upload_time": "2018-04-08T21:55:37", "url": "https://files.pythonhosted.org/packages/50/6f/92d152960e428930bd0ff42b07fce966fadce8b061aeb53c3019379981ac/sparkflow-0.2.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "d8187da1ae2da1fb9a14273ed156f659", "sha256": "15d4f6928df0eeaed9819d2e1a36749c003a21b15ca4ffc5c829d1fb708d75f6" }, "downloads": -1, "filename": "sparkflow-0.2.1.tar.gz", "has_sig": false, "md5_digest": "d8187da1ae2da1fb9a14273ed156f659", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13275, "upload_time": "2018-05-01T22:42:57", "url": "https://files.pythonhosted.org/packages/49/94/ab07c7f2892e188fb1a9d3907d96f0036fdfef61b3afaa342f4ce7b64dc8/sparkflow-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "4cfd6c9f7d457392f877fd40f46c8903", "sha256": "87cb347bcf966f22e1a1e92a70c3ed11030c6d5c70be843f20b5a71b5799e443" }, "downloads": -1, "filename": "sparkflow-0.2.2.tar.gz", "has_sig": false, "md5_digest": "4cfd6c9f7d457392f877fd40f46c8903", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11325, "upload_time": "2018-05-05T14:26:06", "url": "https://files.pythonhosted.org/packages/5c/68/92810d00c7db5b0446508e9a37de8d6595db83f9f4da431f95e4a603bf8d/sparkflow-0.2.2.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "e82f18cc8cbf446c2001bcff15fba56b", "sha256": "9406f719d76792149fee1d43ddfca427a9b2d1c414ed27b4b3ea3a02970cddd6" }, "downloads": -1, "filename": "sparkflow-0.3.0.tar.gz", "has_sig": false, "md5_digest": "e82f18cc8cbf446c2001bcff15fba56b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14138, "upload_time": "2018-07-14T18:46:02", "url": "https://files.pythonhosted.org/packages/80/78/e79888e69cea5f6a264133342e633cad9cda2db8580c054c1eff9850b757/sparkflow-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "0f31648363950174bd4ff691cd7ff4e5", "sha256": "eda33d132334e0c22274121ac54a6e38395bf968986e393f5d430bf6025d16a8" }, "downloads": -1, "filename": "sparkflow-0.3.1.tar.gz", "has_sig": false, "md5_digest": "0f31648363950174bd4ff691cd7ff4e5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14140, "upload_time": "2018-07-14T18:54:45", "url": "https://files.pythonhosted.org/packages/4f/6b/8e559f37137781a4d1b1446143a5a7aedeeb5a98ed5d8f9a2dd5cfe66a9d/sparkflow-0.3.1.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "3d8260183152c2426ae305b57308ca19", "sha256": "18aeabbe72940efc8aabf908fad4f13755c0f7b3805572cc3082ad646413c55c" }, "downloads": -1, "filename": "sparkflow-0.4.0.tar.gz", "has_sig": false, "md5_digest": "3d8260183152c2426ae305b57308ca19", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14271, "upload_time": "2018-09-14T12:29:11", "url": "https://files.pythonhosted.org/packages/ec/8c/831400450676d8e21182d944d515f06dc3e1ce50295e11ee46c2409b0d1a/sparkflow-0.4.0.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "4b17d81a8cfc2024b43326d059153889", "sha256": "59015e6a9518fd08346df70099934e4b2217411c1461879073348a6099c8b4c9" }, "downloads": -1, "filename": "sparkflow-0.4.1-py2-none-any.whl", "has_sig": false, "md5_digest": "4b17d81a8cfc2024b43326d059153889", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 17090, "upload_time": "2018-09-14T17:00:11", "url": "https://files.pythonhosted.org/packages/89/96/095048703181522043587a80c6e53ae32cbf41d31078b902f04ed50b2bf2/sparkflow-0.4.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f7d739e5f55c7c31629c2061981d0842", "sha256": "2c0cdb43cd3303e5cda627ff35a833e332f57356adb21d88626f96524b0b5be1" }, "downloads": -1, "filename": "sparkflow-0.4.1.tar.gz", "has_sig": false, "md5_digest": "f7d739e5f55c7c31629c2061981d0842", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14446, "upload_time": "2018-09-14T17:00:12", "url": "https://files.pythonhosted.org/packages/17/83/a4eb1e5d3eb2a03c0387a283705a2d3bd5cccde18cb690873db2ecf1e224/sparkflow-0.4.1.tar.gz" } ], "0.4.2": [ { "comment_text": "", "digests": { "md5": "535241273ca97dcd2985d8aaeb5f2dfe", "sha256": "27231eb45b3148f1e4e1b744d2c0d5328e13b01c77a68790365e918c57d48da7" }, "downloads": -1, "filename": "sparkflow-0.4.2-py2-none-any.whl", "has_sig": false, "md5_digest": "535241273ca97dcd2985d8aaeb5f2dfe", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 17087, "upload_time": "2018-09-14T19:17:45", "url": "https://files.pythonhosted.org/packages/42/54/72d17228f41c9e30b1794b021bd84a749a5c03fd94d480f598c8092ceefa/sparkflow-0.4.2-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cee2e1ae1ee3397d1e6ef90233676ca7", "sha256": "b16c3cf8e862fc7ea864d141975e33b14230373db02d9c60f6427d52280e308d" }, "downloads": -1, "filename": "sparkflow-0.4.2.tar.gz", "has_sig": false, "md5_digest": "cee2e1ae1ee3397d1e6ef90233676ca7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14456, "upload_time": "2018-09-14T19:17:46", "url": "https://files.pythonhosted.org/packages/d8/e7/5367d8b5dff6b1f9629364fb48b2cec878c343ca33d9cd5b868f7f2bd7d0/sparkflow-0.4.2.tar.gz" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "b9123818ea8969735c019bc8c2c2eecd", "sha256": "29780bfc17795cfe15a1728b2827ea45d9696738bdd41df2072351129ff4f3e3" }, "downloads": -1, "filename": "sparkflow-0.5.0.tar.gz", "has_sig": false, "md5_digest": "b9123818ea8969735c019bc8c2c2eecd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14734, "upload_time": "2018-11-29T02:03:25", "url": "https://files.pythonhosted.org/packages/8e/8d/0aa4821eab14d4596ec6b3702663db3dea3d03bca064b0c2dc91edb57d7b/sparkflow-0.5.0.tar.gz" } ], "0.6.0": [ { "comment_text": "", "digests": { "md5": "d59ee1cfa4f225906de97c4222cb64c7", "sha256": "353748a7bce84668a8503b8af518c53e4f10eb5ba6a8b7def2ef6eb9e306c422" }, "downloads": -1, "filename": "sparkflow-0.6.0.tar.gz", "has_sig": false, "md5_digest": "d59ee1cfa4f225906de97c4222cb64c7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14938, "upload_time": "2019-01-09T01:44:04", "url": "https://files.pythonhosted.org/packages/99/1c/f5cebccb72a41b52cbd02f0ac01f4e267a93e0db90d8f828537c19d1b069/sparkflow-0.6.0.tar.gz" } ], "0.7.0": [ { "comment_text": "", "digests": { "md5": "7ae637a1ec337b7321328c1d694bc5f6", "sha256": "532e5055f6d21333460b5162dfddbe8d53408eaca95d89d0654d6a95e6b2cb57" }, "downloads": -1, "filename": "sparkflow-0.7.0.tar.gz", "has_sig": false, "md5_digest": "7ae637a1ec337b7321328c1d694bc5f6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16732, "upload_time": "2019-05-18T12:55:41", "url": "https://files.pythonhosted.org/packages/12/20/8af389a47d5d0b5d21998aa2e6fd7f3e4358d3e095b8e6ae2a65063f5ace/sparkflow-0.7.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7ae637a1ec337b7321328c1d694bc5f6", "sha256": "532e5055f6d21333460b5162dfddbe8d53408eaca95d89d0654d6a95e6b2cb57" }, "downloads": -1, "filename": "sparkflow-0.7.0.tar.gz", "has_sig": false, "md5_digest": "7ae637a1ec337b7321328c1d694bc5f6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16732, "upload_time": "2019-05-18T12:55:41", "url": "https://files.pythonhosted.org/packages/12/20/8af389a47d5d0b5d21998aa2e6fd7f3e4358d3e095b8e6ae2a65063f5ace/sparkflow-0.7.0.tar.gz" } ] }