{ "info": { "author": "Ori Hoch", "author_email": "ori@uumpa.com", "bugtrack_url": null, "classifiers": [], "description": "# dataflows-serverless\n\nDataFlows-serverless allows to use the [DataFlows](https://github.com/datahq/dataflows) library using serverless concepts.\n\n* Runs on Kubernetes, save costs by starting and stopping the cluster when needed.\n* Supports parallel processing of suitable processing steps (e.g. steps which process each row independently).\n* Develop and test the processing flow locally using the standard DataFlows library.\n\n## Usage\n\n### Create a flow\n\nDevelop a flow locally using the [dataflows](https://github.com/datahq/dataflows) library.\n\nReplace the `Flow` class with `ServerlessFlow` and wrap steps relevant for serverless processing with `serverless_step`\n\nA simple example which downloads from a list of URLs:\n\n```\nfrom dataflows_serverless.flow import ServerlessFlow, serverless_step\nfrom dataflows import dump_to_path\nimport requests\n\nURLS = ['http://httpbin.org/get?page={}'.format(page) for page in range(100)]\n\ndef download(row):\n print('downloading {}'.format(url))\n row['content'] = str(requests.get(row['url']).json())\n\nServerlessFlow(({'url': url, 'content': ''} for url in URLS),\n serverless_step(download),\n dump_to_path('url_contents')).serverless().process()\n```\n\nSave it as `flow.py`, install the dataflows-serverless package and run the flow locally, without serverless:\n\n```\npip3 install dataflows-serverless\npython3 flow.py\n```\n\nThe output data is available at `./url_contents/res_1.csv`\n\n### Setup a Kubernetes cluster\n\nAny recent Kubernetes cluster should work, following are recommended methods to create a cluster\n\n###### Using Google Kubernetes Engine\n\nThis is the recommended method both for testing / development and for running production workloads\n\n* If you have problems running locally, try using [Google Cloud Shell](https://cloud.google.com/shell/) which makes the process even easier.\n* [Install Google Cloud SDK](https://cloud.google.com/sdk/docs/downloads-interactive)\n* Install kubectl:\n * `gcloud components install kubectl`\n* Login to Google cloud with application default credentials (ADC):\n * `gcloud auth application-default login`\n* Connect to an existing dataflows cluster or create a new one if it doesn't exist:\n * `$(dataflows_serverless_bin)/gke_connect_or_create.sh `\n* You can configure some additional arguments regarding the created cluster, run the `gke_connect_or_create`\n script without any parameters to see the available arguments.\n\n###### Using Minikube\n\nMinikube is a Kubernetes cluster which runs locally on a virtual machine:\n\n* Install [minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/)\n* Start the minikube cluster:\n * `minikube start`\n* Switch to the minikube context:\n * `kubectl config use-context minikube`\n* Create the dataflows namespace:\n * `kubectl create ns datafows`\n* Set the current context to use this namespace by default:\n * `kubectl config set-context minikube --namespace=dataflows`\n\n### Run the flow on the cluster\n\nThe following command starts 1 primary job which runs the main flow and 10 secondary jobs which run the serverless steps.\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents\n```\n\nThe output data is available locally at the same path as the output path of the non-serverless flow: `./url_contents/res_1.csv`\n\nFirst run can take a while due to pulling of Docker images, subsequent flows will run significantly faster.\n\nIf you are using Google Kubernetes Engine, don't forget to delete the cluster when done:\n\n```\n$(dataflows_serverless_bin)/gke_cleanup.sh \n```\n\n## Advanced Usage\n\n### Installing requirements / system dependencies\n\nTo install additional requirements for your flow you need to create a Docker image from the dataflows image\n\nGet the relevant docker image by running `dataflows_serverless_image`\n\nThe following example Dockerfile adds the Python imaging library:\n\n```\nFROM orihoch/dataflows-serverless:9\nRUN apk add --update --no-cache zlib-dev jpeg-dev && pip3 install pillow\n```\n\nBuild and push the modified image. Kubernetes doesn't pull the image if it already exists, so make sure to modify the tag on each image you build.\n\nRun the flow using your modified image\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents --image=your-username/dataflows-serverless-pillow:0.0.1\n```\n\n### Providing input data\n\nYou can provide input data which will be copied before the jobs are started, all input data should be in a relative path to current working directory\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents --input-datadir=data/input_dir_1 --input-datadir=data/input_dir_2\n```\n\n### Advanced data initialization\n\nTo provide large data and more advanced data initialization you can provide a data init container.\n\nThe following example shows how to use Google Cloud Storage.\n\nCreate a data initialization Docker image which copies data to the `/exports/data` directory:\n\n```\nFROM google/cloud-sdk\nENTRYPOINT [\"bash\", \"-c\", \"\\\n mkdir -p /exports/data/ &&\\\n gcloud --project=GOOGLE_PROJECT_ID auth activate-service-account --key-file=/secrets/service-account.json &&\\\n gsutil -m cp -r gs://BUCKET_NAME/data/ /exports/\n\"]\n```\n\nThe image needs credentials to access Google Storage, the following script creates a service account for authentication to Google Cloud,\nrelated service account key and Kubernetes secret\n\n```\nGOOGLE_PROJECT_ID=\nSERVICE_ACCOUNT_NAME=\nSECRET_NAME=\ngcloud --project=$GOOGLE_PROJECT_ID iam service-accounts create $SERVICE_ACCOUNT_NAME &&\\\ngcloud projects add-iam-policy-binding $GOOGLE_PROJECT_ID \\\n --member \"serviceAccount:${SERVICE_ACCOUNT_NAME}@${GOOGLE_PROJECT_ID}.iam.gserviceaccount.com\" \\\n --role \"roles/storage.objectAdmin\" &&\\\ngcloud iam service-accounts keys create secret-service-account.json --iam-account ${SERVICE_ACCOUNT_NAME}@${GOOGLE_PROJECT_ID}.iam.gserviceaccount.com &&\\\nkubectl create secret generic ${SECRET_NAME} --from-file=service-account.json=secret-service-account.json\n```\n\nYou will probably need to modify your flow to get the input data from the right directory.\n\nYou can check for DATAFLOWS_WORKDIR environment variable to conditionally use the right path on server and locally\n\n```\ndata_dir = os.path.join(os.environ.get('DATAFLOWS_WORKDIR', '.'), 'data')\n```\n\nRun the serverless flow:\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents \\\n --data-init-image=your-username/data-init-image --data-init-secret=${SECRET_NAME}\n```\n\nTo prevent reloading of data you can keep the data server running, provide a unique value for the --nfs-uuid= argument:\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents \\\n --data-init-image=your-username/data-init --data-init-secret=${SECRET_NAME} \\\n --nfs-uuid=my-nfs\n```\n\nPay attention that in this case, you have to make sure a single primary job is running on each data server.\n\n### Prevent cleanup\n\nPrevent cleanup of created resources - allowing to debug\n\n```\npython3 flow.py --serverless --secondaries=10 --output-datadir=url_contents --no-cleanup\n```\n\nDoing cleanup can be problematic due to the dynamic nature and usage of NFS, however, the following commands should do it:\n\n```\nkubectl delete jobs --all && kubectl delete all --all\n```\n\n### Debugging flows remotely\n\nStart the serverless flow in debug mode\n\n```\npython3 flow.py --serverless --secondaries=2 --output-datadir=url_contents --debug\n```\n\nYou will now need to manually run the jobs by executing on the relevant pod, for example:\n\n```\nkubectl exec -it PRIMARY_POD_NAME /entrypoint.sh\nkubectl exec -it SECONDARY_1_POD_NAME /entrypoint.sh\nkubectl exec -it SECONDARY_2_POD_NAME /entrypoint.sh\n```\n\n## Updating and publishing a dataflows-serverless release\n\nUpdate the version in `VERSION.txt`\n\nUpdate the `DEFAULT_IMAGE` constant in `dataflows_serverless/constants.py` to the new Docker image tag.\n\nBuild and publish the Docker image: `docker build -t orihoch/dataflows-serverless:v$(cat VERSION.txt) . && docker push orihoch/dataflows-serverless:v$(cat VERSION.txt)`\n\nBuild and publish the package on PyPI: `python setup.py sdist && twine upload dist/dataflows_serverless-$(cat VERSION.txt).tar.gz`", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/OriHoch/dataflows-serverless", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "dataflows-serverless", "package_url": "https://pypi.org/project/dataflows-serverless/", "platform": "", "project_url": "https://pypi.org/project/dataflows-serverless/", "project_urls": { "Homepage": "https://github.com/OriHoch/dataflows-serverless" }, "release_url": "https://pypi.org/project/dataflows-serverless/0.0.2/", "requires_dist": null, "requires_python": "", "summary": "Use the DataFlows library using serverless concepts.", "version": "0.0.2" }, "last_serial": 4301228, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "cbef46f020742a00dbe7a580719b9676", "sha256": "935e2414080121c4c9bf57965dafa17110b005f33492bed9d6d1cdfae41b0104" }, "downloads": -1, "filename": "dataflows_serverless-0.0.1.tar.gz", "has_sig": false, "md5_digest": "cbef46f020742a00dbe7a580719b9676", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14860, "upload_time": "2018-09-23T08:17:02", "url": "https://files.pythonhosted.org/packages/75/fd/6980836201559a9e3e6e7b4c18626ed26381a8006e8fb6f81935c180a9cb/dataflows_serverless-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "0affb1798043957926b09cdf37919b2b", "sha256": "2919ba4bd9f0f01c39969b1e4b9856e9356410ae819dc99804b59442122a17ef" }, "downloads": -1, "filename": "dataflows_serverless-0.0.2.tar.gz", "has_sig": false, "md5_digest": "0affb1798043957926b09cdf37919b2b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14930, "upload_time": "2018-09-23T08:24:05", "url": "https://files.pythonhosted.org/packages/78/34/4a3fb1402a8e5fa0f4a491985c3123d5dbbcf9f61eb89e6f912ada94caf3/dataflows_serverless-0.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0affb1798043957926b09cdf37919b2b", "sha256": "2919ba4bd9f0f01c39969b1e4b9856e9356410ae819dc99804b59442122a17ef" }, "downloads": -1, "filename": "dataflows_serverless-0.0.2.tar.gz", "has_sig": false, "md5_digest": "0affb1798043957926b09cdf37919b2b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14930, "upload_time": "2018-09-23T08:24:05", "url": "https://files.pythonhosted.org/packages/78/34/4a3fb1402a8e5fa0f4a491985c3123d5dbbcf9f61eb89e6f912ada94caf3/dataflows_serverless-0.0.2.tar.gz" } ] }