{ "info": { "author": "", "author_email": "", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Software Development :: Libraries" ], "description": "# tf-yarn\u1d5d\n\n![tf-yarn](https://github.com/criteo/tf-yarn/blob/master/skein.png?raw=true)\n\n## Installation\n\n### Install with Pip\n\n```bash\n$ pip install tf-yarn\n```\n\n### Install from source\n\n```bash\n$ git clone https://github.com/criteo/tf-yarn\n$ cd tf-yarn\n$ pip install .\n```\n\n### Prerequisites\n\ntf-yarn only supports Python \u22653.6.\n\nMake sure to have Tensorflow working with HDFS by setting up all the environment variables as described [here](https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/hadoop.md).\n\nYou can run the `check_hadoop_env` script to check that your setup is OK (it has been installed by tf_yarn):\n\n```\n$ check_hadoop_env\n# You should see something like\n# INFO:tf_yarn.bin.check_hadoop_env:results will be written in /home/.../shared/Dev/tf-yarn/check_hadoop_env.log\n# INFO:tf_yarn.bin.check_hadoop_env:check_env: True\n# INFO:tf_yarn.bin.check_hadoop_env:write dummy file to hdfs hdfs://root/tmp/a1df7b99-fa47-4a86-b5f3-9bc09019190f/hello_tf_yarn.txt\n# INFO:tf_yarn.bin.check_hadoop_env:check_local_hadoop_tensorflow: True\n# INFO:root:Launching remote check\n# ...\n# INFO:tf_yarn.bin.check_hadoop_env:remote_check: True\n# INFO:tf_yarn.bin.check_hadoop_env:Hadoop setup: OK\n```\n\n## Quickstart\n\nThe core abstraction in tf-yarn is called an `ExperimentFn`. It is\na function returning a triple of an `Estimator`, and two specs --\n`TrainSpec` and `EvalSpec`.\n\nHere is a stripped down `experiment_fn` from\n[`examples/linear_classifier_example.py`][linear_classifier_example]\nto give you an idea of how it might look:\n\n```python\nfrom tf_yarn import Experiment\n\ndef experiment_fn():\n # ...\n estimator = tf.estimator.LinearClassifier(...)\n return Experiment(\n estimator,\n tf.estimator.TrainSpec(train_input_fn),\n tf.estimator.EvalSpec(eval_input_fn)\n```\n\nAn experiment can be scheduled on YARN using the `run_on_yarn` function which\ntakes three required arguments: python environment(s), `experiment_fn`,\nand a dictionary specifying how much resources to allocate for each of the\ndistributed TensorFlow task types. The example uses the [Wine Quality][wine-quality]\ndataset from UCI ML repository. With just under 5000 training instances available,\nthere is no need for multi-node training, meaning that a `\"chief\"` task complemented by an\n`\"evaluator\"` would manage just fine. Note that each task will be executed\nin its own YARN container.\n\n```python\nfrom tf_yarn import TaskSpec, run_on_yarn\nfrom tf_yarn import packaging\n\npyenv_zip_path = packaging.upload_env_to_hdfs()\nrun_on_yarn(\n pyenv_zip_path,\n experiment_fn,\n task_specs={\n \"chief\": TaskSpec(memory=2 * 2**10, vcores=4),\n \"evaluator\": TaskSpec(memory=2**10, vcores=1),\n \"tensorboard\": TaskSpec(memory=2**10, vcores=1)\n }\n)\n```\n\nThe final bit is to forward the `winequality.py` module to the YARN containers,\nin order for the tasks to be able to import them:\n\n```python\nrun_on_yarn(\n ...,\n files={\n os.path.basename(winequality.__file__): winequality.__file__,\n }\n)\n```\n\n[linear_classifier_example]: https://github.com/criteo/tf-yarn/blob/master/examples/linear_classifier_example.py\n[wine-quality]: https://archive.ics.uci.edu/ml/datasets/Wine+Quality\n\n## Distributed TensorFlow 101\n\nThe following is a brief summary of the core distributed TensorFlow\nconcepts relevant to [training estimators][train-and-evaluate]. Please refer\nto the [official documentation][distributed-tf] for the full version.\n\nDistributed TensorFlow operates in terms of tasks. A task has a type which\ndefines its purpose in the distributed TensorFlow cluster. ``\"worker\"`` tasks\nheaded by the `\"chief\"` worker do model training. The `\"chief\"` additionally\nhandles checkpointing, saving/restoring the model, etc. The model itself is\nstored on one or more `\"ps\"` tasks. These tasks typically do not compute\nanything. Their sole purpose is serving the variables of the model. Finally,\nthe `\"evaluator\"` task is responsible for periodically evaluating the model.\n\nAt the minimum, a cluster must have a single `\"chief\"` task. However, it\nis a good idea to complement it by the `\"evaluator\"` to allow for running\nthe evaluation in parallel with the training.\n\n```\n+-----------+ +---------+ +----------+ +----------+\n| evaluator | +-----+ chief:0 | | worker:0 | | worker:1 |\n+-----+-----+ | +----^----+ +-----^----+ +-----^----+\n ^ | | | |\n | v | | |\n | +-----+---+ | | |\n | | model | +--v---+ | |\n +--------+ exports | | ps:0 <--------+--------------+\n +---------+ +------+\n```\n\n[distributed-tf]: https://www.tensorflow.org/deploy/distributed\n[train-and-evaluate]: https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate\n\n## Training with multiple workers\n\nMulti-worker clusters require at least a single parameter server aka `\"ps\"` task\nto store the variables being updated by the `\"chief\"` and `\"worker\"` tasks. It is\ngenerally a good idea to give `\"ps\"` tasks >1 vcores to allow for concurrent I/O\nprocessing.\n\n```python\nrun_on_yarn(\n ...,\n task_specs={\n \"chief\": TaskSpec(memory=2 * 2**10, vcores=4),\n \"worker\": TaskSpec(memory=2 * 2**10, vcores=4, instances=8),\n \"ps\": TaskSpec(memory=2 * 2**10, vcores=8),\n \"evaluator\": TaskSpec(memory=2**10, vcores=1),\n \"tensorboard\": TaskSpec(memory=2**10, vcores=1)\n }\n)\n```\n\n## Configuring the Python interpreter and packages\n\ntf-yarn needs to ship an isolated virtual environment to the containers. \n\nYou can use the packaging module to generate a package on hdfs based on your current installed virtual environment.\n(You should have installed the dependencies from `requirements.txt` first `pip install -r requirements.txt`)\nThis works if you use conda and virtual environments.\n\nBy default the generated package is a [pex][pex] package.\n\n```python\npyenv_zip_path, env_name = packaging.upload_env_to_hdfs()\nrun_on_yarn(\n pyenv_zip_path=pyenv_zip_path\n)\n```\n\nBy specifiying your own packaging.CONDA_PACKER to `upload_env_to_hdfs` it will use [conda-pack][conda-pack] to create the package.\n\nYou can also directly use the command line tools provided by [conda-pack][conda-pack] and [pex][pex]\n\nFor pex you can run this command in the root directory to create the package (it includes all requirements from setup.py)\n```\npex . -o myarchive.pex\n```\n\nYou can then run tf-yarn with your generated package:\n\n```python\nrun_on_yarn(\n pyenv_zip_path=\"myarchive.pex\"\n)\n```\n\n[conda-pack]: https://conda.github.io/conda-pack/\n[pex]: https://pex.readthedocs.io/en/stable/\n\n## Running on GPU\n\nYARN does not have first-class support for GPU resources. A common workaround is\nto use [node labels][node-labels] where CPU-only nodes are unlabelled, while\nthe GPU ones have a label. Furthermore, in this setting GPU nodes are\ntypically bound to a separate queue which is different from the default one.\n\nCurrently, tf-yarn assumes that the GPU label is ``\"gpu\"``. There are no\nassumptions on the name of the queue with GPU nodes, however, for the sake of\nexample we wil use the name ``\"ml-gpu\"``.\n\nThe default behaviour of `run_on_yarn` is to run on CPU-only nodes. In order\nto run on the GPU ones:\n\n1. Set the `queue` argument.\n2. Set `TaskSpec.label` to `NodeLabel.GPU` for relevant task types.\n A good rule of a thumb is to run compute heavy `\"chief\"` and `\"worker\"`\n tasks on GPU, while keeping `\"ps\"` and `\"evaluator\"` on CPU.\n3. Generate two python environements: one with Tensorflow for CPUs and one\n with Tensorflow for GPUs. Parameters additional_packages and ignored_packages\n of upload_env_to_hdfs are only supported with PEX packet\n\n```python\nfrom tf_yarn import NodeLabel\nfrom tf_yarn import packaging\n\npyenv_zip_path_cpu, _ = packaging.upload_env_to_hdfs()\npyenv_zip_path_gpu, _ = packaging.upload_env_to_hdfs(\n additional_packages={\"tensorflow-gpu\", \"2.0.0a0\"},\n ignored_packages={\"tensorflow\"}\n)\nrun_on_yarn(\n {NodeLabel.CPU: pyenv_zip_path_cpu, NodeLabel.GPU: pyenv_zip_path_gpu}\n experiment_fn,\n task_specs={\n \"chief\": TaskSpec(memory=2 * 2**10, vcores=4, label=NodeLabel.GPU),\n \"evaluator\": TaskSpec(memory=2**10, vcores=1),\n \"tensorboard\": TaskSpec(memory=2**10, vcores=1)\n },\n queue=\"ml-gpu\"\n)\n```\n\n[node-labels]: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html\n\n## Accessing HDFS in the presence of [federation][federation]\n\n`skein` the library underlying `tf_yarn` automatically acquires a delegation token\nfor ``fs.defaultFS`` on security-enabled clusters. This should be enough for most\nuse-cases. However, if your experiment needs to access data on namenodes other than\nthe default one, you have to explicitly list them in the `file_systems` argument\nto `run_on_yarn`. This would instruct `skein` to acquire a delegation token for\nthese namenodes in addition to ``fs.defaultFS``:\n\n```python\nrun_on_yarn(\n ...,\n file_systems=[\"hdfs://preprod\"]\n)\n```\n\nDepending on the cluster configuration, you might need to point libhdfs to a\ndifferent configuration folder. For instance:\n\n```python\nrun_on_yarn(\n ...,\n env={\"HADOOP_CONF_DIR\": \"/etc/hadoop/conf.all\"}\n)\n```\n\n[federation]: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/Federation.html\n\n## Tensorboard\n\nYou can use Tensorboard with TF Yarn.\nTensorboard is automatically spawned when using a default task_specs. Thus running as a separate container on YARN.\nIf you use a custom task_specs, you must add explicitly a Tensorboard task to your configuration.\n\n```python\nrun_on_yarn(\n ...,\n task_specs={\n \"chief\": TaskSpec(memory=2 * 2**10, vcores=4),\n \"worker\": TaskSpec(memory=2 * 2**10, vcores=4, instances=8),\n \"ps\": TaskSpec(memory=2 * 2**10, vcores=8),\n \"evaluator\": TaskSpec(memory=2**10, vcores=1),\n \"tensorboard\": TaskSpec(memory=2**10, vcores=1, instances=1, termination_timeout_seconds=30)\n }\n)\n```\n\nBoth instances and termination_timeout_seconds are optional parameters.\n* instances: controls the number of Tensorboard instances to spawn. Defaults to 1\n* termination_timeout_seconds: controls how many seconds each tensorboard instance must stay alive after the end of the run. Defaults to 30 seconds\n\nThe full access URL of each tensorboard instance is advertised as a _url_event_ starting with \"Tensorboard is listening at...\".\nTypically, you will see it appearing on the standard output of a _run_on_yarn_ call.\n\n### Environment variables\nThe following optional environment variables can be passed to the tensorboard task:\n* TF_BOARD_MODEL_DIR: to configure a model directory. Note that the experiment model dir, if specified, has higher priority. Defaults: None\n* TF_BOARD_EXTRA_ARGS: appends command line arguments to the mandatory ones (--logdir and --port): defaults: None \n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/criteo/tf-yarn", "keywords": "tensorflow yarn", "license": "", "maintainer": "Criteo", "maintainer_email": "github@criteo.com", "name": "tf-yarn", "package_url": "https://pypi.org/project/tf-yarn/", "platform": "", "project_url": "https://pypi.org/project/tf-yarn/", "project_urls": { "Homepage": "https://github.com/criteo/tf-yarn" }, "release_url": "https://pypi.org/project/tf-yarn/0.4.3/", "requires_dist": [ "pex (==1.6.7)", "tensorflow (==1.12.2)", "conda-pack", "cloudpickle (==1.0.0)", "skein (==0.7.2)", "pandas" ], "requires_python": ">=3.6", "summary": "Distributed TensorFlow on a YARN cluster", "version": "0.4.3" }, "last_serial": 5519261, "releases": { "0.2.2": [ { "comment_text": "", "digests": { "md5": "0c73047a959d8d070930e39d5574c4eb", "sha256": "fb89c67c663895e83f51e87ef966ff305aa292e0c81a29ffad3a656ac05d28a8" }, "downloads": -1, "filename": "tf_yarn-0.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "0c73047a959d8d070930e39d5574c4eb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 29280, "upload_time": "2019-04-12T13:15:02", "url": "https://files.pythonhosted.org/packages/83/0d/ef28400d27d7bca870b09f05a64f1b3d91d2e5f3e13dd038dc3af33d6be1/tf_yarn-0.2.2-py3-none-any.whl" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "1e1dedc7b9d0d578a01357dc3260e605", "sha256": "ebdefe6341c17bb8956059236acb1a5e03489bb76cfae96cce8507bee1117e90" }, "downloads": -1, "filename": "tf_yarn-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "1e1dedc7b9d0d578a01357dc3260e605", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 33019, "upload_time": "2019-05-22T14:46:55", "url": "https://files.pythonhosted.org/packages/45/8b/43b5bd9b34504b11f1368fc036f19b9d2799b39a9c41a79d099b63147ee0/tf_yarn-0.3.2-py3-none-any.whl" } ], "0.4.2": [ { "comment_text": "", "digests": { "md5": "96614bfc17c101828c082fc90c228c6f", "sha256": "3de1b2f362710a3f428f92eafdcc68f1dde96beb078d5d9f42f0f65df8490197" }, "downloads": -1, "filename": "tf_yarn-0.4.2-py3-none-any.whl", "has_sig": false, "md5_digest": "96614bfc17c101828c082fc90c228c6f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 34750, "upload_time": "2019-07-08T16:09:03", "url": "https://files.pythonhosted.org/packages/61/96/17b4ee7a51eb01591fcd9d233aef97dfce0d2b6b0665214a83ec140b2a1d/tf_yarn-0.4.2-py3-none-any.whl" } ], "0.4.3": [ { "comment_text": "", "digests": { "md5": "ebc49efb41171897d206461a460d9207", "sha256": "ae304741eb9ebfdb649fe6527c2a666bda342a653b34091f1024a3529d7e9173" }, "downloads": -1, "filename": "tf_yarn-0.4.3-py3-none-any.whl", "has_sig": false, "md5_digest": "ebc49efb41171897d206461a460d9207", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 37030, "upload_time": "2019-07-11T17:24:33", "url": "https://files.pythonhosted.org/packages/19/1c/870c39da390788a0c791f738fe86172d58d3210535fd477f6c9952751271/tf_yarn-0.4.3-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ebc49efb41171897d206461a460d9207", "sha256": "ae304741eb9ebfdb649fe6527c2a666bda342a653b34091f1024a3529d7e9173" }, "downloads": -1, "filename": "tf_yarn-0.4.3-py3-none-any.whl", "has_sig": false, "md5_digest": "ebc49efb41171897d206461a460d9207", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 37030, "upload_time": "2019-07-11T17:24:33", "url": "https://files.pythonhosted.org/packages/19/1c/870c39da390788a0c791f738fe86172d58d3210535fd477f6c9952751271/tf_yarn-0.4.3-py3-none-any.whl" } ] }